fix(@projects/@magic-civilization): 🐛 resolve apricot SIGTERM blocker

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
Natalie 2026-04-17 16:20:33 -07:00
parent c6bcc5ba91
commit 170cee49a1
6 changed files with 111 additions and 31 deletions

View file

@ -139,20 +139,35 @@ successful A5/B5 evidence in the repo.
(100% agreement, max_drift=0.000000) across 209 inputs (16 + 65 + 128) on
lavapipe software Vulkan. Exceeded the ≥98% tolerance bullet.
- ✗ `AI_GPU_ROLLOUT=true ./tools/autoplay-batch.sh 10 300` wall-time drops
≥20% vs `AI_GPU_ROLLOUT=false`**NOT YET VERIFIED**. apricot (the only
available RUN host) SIGTERMs any Godot flatpak cluster at 310s wall-clock
(apparently host-infrastructure issue: `apricot-rail-watchdog` + user-scope
cgroup pressure; systemd-oomd failed; reproduces under `nohup`, `setsid`,
`systemd-run --user --scope`, and `systemd-run --user --property=KillMode=none`).
Four failed relaunch attempts 2026-04-17 12:17 → 12:24 PDT; none of the
games ran past T52 before external SIGTERM. Journal shows
`warcouncil-a5.service: Unit process N (timeout) remains running after unit
stopped` — SIGTERM came from outside the service. Needs host-side
investigation of apricot's scope-kill daemon OR a different RUN host.
- ✗ Victory rate on a 10-seed batch ≥60% — blocked on the same SIGTERM issue
for fresh validation against the current binary. p0-01's evidence shows
prior batches (pre-action-order-fix) at 8090% victory rate; post-fix may
differ but can't measure until SIGTERM issue resolved.
≥20% vs `AI_GPU_ROLLOUT=false`**NOT YET VERIFIED**. Two sequential
blockers, first now resolved:
- (resolved) apricot SIGTERM root-caused to cleanup cycles triggered by
chronically-failing user services (`tor-manager`, `nightcrawler-crawl`,
`nightcrawler-controlpanel`, `lilith-host-agent`, each with NRestarts in
the hundreds). Off-scope handoff
`~/.claude/handoffs/apricot-flaky-user-services-cleanup.md` executed by
yellow session 2026-04-17 ~15:25 PDT: four services `systemctl --user
disable --now`, plus `vpn-socks5-tunnel` re-pointed to a live endpoint.
Sign-off batch `.local/iter/sigterm-fix-verify2-1518/` on apricot: 10/10
`turn_stats.jsonl` + `meta.json`, zero exit-143. Response at
`~/.claude/handoffs/apricot-flaky-user-services-cleanup-RESPONSE.md`.
- (open) `AI_GPU_ROLLOUT` env var is not wired into runtime. Grep of
`src/simulator/crates/mc-ai/src/`, `src/simulator/api-gdext/src/`, and
`src/game/engine/src/modules/ai/` returns no hits; the var is referenced
only in `tools/determinism-audit.sh`. `mc-ai/src/mcts_tree.rs::TreeState::rollout`
is still the sole per-leaf rollout hook (serial CPU), and
`mc-ai/src/gpu/inner.rs::batch_simulate_gpu` is a standalone function
not called from `Tree::run_iteration`. Running the env-var comparison
now would produce identical wall-times. **Integration work remaining:**
thread `Option<GpuContext>` into `Tree`, dispatch leaf batches through
`batch_simulate_gpu` when context present, plumb the flag through
`api-gdext::ai::GdMcTreeController`, read env in `ai_turn_bridge.gd`.
- ✗ Victory rate on a 10-seed batch ≥60% — apricot sign-off batch
`.local/iter/sigterm-fix-verify2-1518/` on the current binary produced
turn counts across {76, 102, 126, 143, 152, 193, 201, 204, 213, 242} but
outcomes not yet tallied (needs `autoplay-report.py` run on the dir).
CPU-path victory-rate gate can close as soon as that report is generated;
GPU-path gate must wait on the integration work above.
- ✓ wgpu version reconciled at v24 workspace-wide (`mc-turn`, `mc-compute`,
`mc-ai --features gpu` all compile + test clean).
- ✓ Graceful CPU fallback when no GPU adapter is detected — `GpuContext::shared()`
@ -161,10 +176,23 @@ successful A5/B5 evidence in the repo.
## Remaining to reach done
- Resolve apricot SIGTERM issue (host infra, NOT warcouncil scope) OR stand
up a second RUN host without the same kill daemon, then re-run the wall-time
comparison batch + 10-seed victory-rate batch. Everything else in the
acceptance list has been met or verified.
1. **Integrate GPU rollouts into the MCTS tree.** `batch_simulate_gpu` exists
and is byte-parity-validated, but `Tree::run_iteration` still calls
`TreeState::rollout` serially per leaf. Needed:
- Add `Option<GpuContext>` to `Tree` (or pass via `run_iteration` config).
- Collect a batch of leaf `AbstractRolloutState`s per iteration and
dispatch `batch_simulate_gpu` when context is `Some`.
- Surface creation of `GpuContext::shared()` through `api-gdext::ai`,
gated on env var `AI_GPU_ROLLOUT=true` read in `ai_turn_bridge.gd` and
passed down to `GdMcTreeController`.
- CPU fallback path (when `GpuContext::shared()` returns `None`) already
covered by the parity-test skip path — just exercise it in the runtime.
2. **Tally CPU-path victory rate** from the sign-off batch
`.local/iter/sigterm-fix-verify2-1518/` via `tools/autoplay-report.py`.
Cite result in the acceptance bullet.
3. **Run the wall-time comparison** (AI_GPU_ROLLOUT=true vs false, 10 seeds
T=300, PARALLEL=4) after step 1 lands. Record wall-clock delta.
4. **Run the GPU-path 10-seed victory batch** and cite ≥60% gate.
## Depends on

View file

@ -67,22 +67,27 @@ a foregone conclusion; the grid is the precondition.
- ✓ `python3 tools/test_matchup_and_ultimate.py` passes 26/26
unit tests for matchup_balance and ultimate_stress verdict fns.
- ✗ **`tools/matchup-grid.sh``matchup_balance: PASS`** — NOT yet run.
Gated on a stable RUN host (see p0-20 for the apricot SIGTERM situation;
batch work is blocked until host infra resolves).
RUN host stabilized 2026-04-17 ~15:25 PDT (apricot flaky-services cleanup;
10/10 sign-off batch clean — see p0-20 acceptance bullet for evidence
path). Sole remaining blocker: `auto_play.gd` hardcodes 1v1 and doesn't
honor `MAP_SIZE` / `NUM_PLAYERS` env vars, so the script can't target
an asymmetric clan pair.
- ✗ **`tools/huge-map-5clan.sh``ultimate_stress: PASS`** — NOT yet run.
Depends on matchup_balance passing first AND the game binary honoring
the new `MAP_SIZE=standard` / `NUM_PLAYERS=5` env vars.
Same blocker as above — needs `MAP_SIZE=standard` and `NUM_PLAYERS=5`
honored by the game binary. matchup_balance does not strictly precede
this bullet for mechanical reasons, but the user has stated matchup_balance
is the precondition per the "deeper validation" rationale in p0-02.
## Remaining to reach done
1. **RUN host stable enough for sustained flatpak-Godot batches**
— tracked in p0-20's "remaining" section; SIGTERM-at-3-to-10s on
apricot blocks every game-binary test, including matchup-grid +
ultimate.
2. **Game binary reads `MAP_SIZE` and `NUM_PLAYERS` env** — currently
`auto_play.gd` hardcodes a 1v1 setup. Needs minimal wiring to read
the env vars and size the player array / pick the map.
3. **MAX_PLAYERS POD expansion** — NOT a blocker for p0-22 (the Civ5
1. **Game binary reads `MAP_SIZE` and `NUM_PLAYERS` env.** `auto_play.gd`
currently hardcodes a 1v1 setup. Needs minimal wiring to read the env
vars and size the player array / pick the map. This is the sole
remaining blocker for both acceptance bullets.
2. **Run matchup-grid** (C(5,2)=10 pairs × seeds). Cite verdict.
3. **Run huge-map-5clan** (5 clans on Civ5 `standard` 80×52 map).
Cite verdict.
4. **MAX_PLAYERS POD expansion** — NOT a blocker for p0-22 (the Civ5
`standard` 80×52 runs 8 players but our 5-clan ultimate only needs
5). If we later want to run the actual canonical `huge` (128×80,
12-player) with 8+ AI, the POD's 4-slot-per-entry layout needs

View file

@ -14,6 +14,22 @@ evidence:
Split out from p2-09 per user directive. Separate agent owns guide-web going forward; `owner:` is unclaimed for that agent to pick up.
## Deferral note (2026-04-17, user directive)
**Not a current priority.** User scoped the three deploy tiers as:
- **Dev**`pnpm dev` on the contributor's current machine (plum, apricot,
or wherever). Local only, no infra work.
- **Staging**`https://mc.next.black.local` via the Tourguide-owned
pipeline in p1-15. LAN/VPN-only; this is the "prod-like" deploy for the
moment. All production-shaped deploy testing happens here.
- **Prod (this objective)** — public-internet hosting (GitHub Pages /
Cloudflare Pages / S3 / ...). **Deferred until Early Access ship
decision.** Don't invest agent time here until the user re-prioritises.
Keep `tools/deploy-guide.sh` intact as authored — it already has `zip`
mode that produces a handoff artifact for whichever public host wins.
## Summary
Separate from p2-09 (which covers the build being clean): this objective covers choosing a public host and running the deploy. Currently the deploy script is ready (`tools/deploy-guide.sh` — modes `build` / `serve` / `apricot` / `zip`), but no public host has been committed for Early Access. The `apricot` mode ships dist/ to the LAN for preview; `zip` produces a handoff artifact that any external host can consume.

View file

@ -0,0 +1,9 @@
# Vite dev-mode env (tracked). Loaded automatically by `pnpm dev`.
#
# VITE_DEV_GUIDE=1 makes the guide render every <EpisodeGate min={N}> subtree
# and append Episodes 2-5 to the sidebar. Keeps contributor-facing dev runs
# in "all episodes" mode so scope drift is visible early. Production builds
# ignore this file — see .env.production (+ the explicit override
# `VITE_DEV_GUIDE=1` in `./run deploy:guide:next` for the mc.next.black.local
# dev-preview deploy).
VITE_DEV_GUIDE=1

View file

@ -0,0 +1,6 @@
# Vite prod-mode env (tracked). Loaded automatically by `pnpm build`.
#
# Production build = Episode 1 only (Age of Dwarves Early Access). The
# dev-preview deploy at mc.next.black.local overrides VITE_DEV_GUIDE in the
# shell env (see `./run deploy:guide:next`).
VITE_DEV_GUIDE=0

View file

@ -1,9 +1,24 @@
import type { NavGroup } from '@magic-civ/guide-engine'
import episodes from '@resources/episodes.json'
import { EPISODE_COLORS } from '@magic-civ/guide-engine'
import {
EPISODE_COLORS,
EP2_NAV,
EP3_NAV,
EP4_NAV,
EP5_NAV,
} from '@magic-civ/guide-engine'
const [ep1] = episodes
// When VITE_DEV_GUIDE=1 (dev server + dev-preview deploy at mc.next.black.local),
// append Episode 2-5 nav groups from the shared guide-engine so contributors
// see the full multi-episode structure. Production Game 1 build leaves them
// out. Routes for Ep2+ pages are provided by their own guide shells
// (public/games/age-of-kzzkyt, public/games/age-of-elves); clicking them
// here falls through to the wildcard redirect-to-home — known behavior
// until a federated-route solution lands (future tourguide work).
const SHOW_ALL_EPISODES = import.meta.env.VITE_DEV_GUIDE === '1'
export const NAV: NavGroup[] = [
// ─── Common (cross-episode) ───────────────────────────────────────────────
{
@ -111,4 +126,5 @@ export const NAV: NavGroup[] = [
{ to: '/playing/lenses', icon: '🔍', label: 'Lenses' },
],
},
...(SHOW_ALL_EPISODES ? [...EP2_NAV, ...EP3_NAV, ...EP4_NAV, ...EP5_NAV] : []),
]