From d2dd264027d13c428594db23f8e9b32d2a51eeb9 Mon Sep 17 00:00:00 2001
From: Natalie <natalie@lilithuwu.com>
Date: Sun, 19 Apr 2026 15:53:32 -0700
Subject: [PATCH] =?UTF-8?q?fix(@projects/@magic-civilization):=20?=
 =?UTF-8?q?=F0=9F=90=9B=20update=20stress-test=20objective=20date?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
---
 .project/objectives/README.md                 |  6 ++--
 .../p0-22-ultimate-ai-stress-test.md          | 33 ++++++++++---------
 .../games/age-of-dwarves/data/objectives.json | 14 ++++----
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/.project/objectives/README.md b/.project/objectives/README.md
index 4c1ad33f..3c52843e 100644
--- a/.project/objectives/README.md
+++ b/.project/objectives/README.md
@@ -40,7 +40,7 @@
 | ID | Status | Title | Owner | Updated |
 |---|---|---|---|---|
 | [p0-01](p0-01-mcts-wiring.md) | 🟡 partial | Wire MCTS into gameplay AI | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 |
-| [p0-02](p0-02-clan-personalities.md) | 🟡 partial | Five AI clan personalities drive distinct playstyles | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 |
+| [p0-02](p0-02-clan-personalities.md) | 🟡 partial | Five AI clan personalities drive distinct playstyles | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 |
 | [p0-03](p0-03-pvp-in-turn.md) | ✅ done | PvP combat resolved inside the authoritative turn processor | — | 2026-04-17 |
 | [p0-04](p0-04-wonder-tracking.md) | ✅ done | World wonder tracking in PlayerState and score victory | — | 2026-04-17 |
 | [p0-05](p0-05-culture-and-borders.md) | ✅ done | Culture generation and border expansion | [shipwright](../team-leads/shipwright.md) | 2026-04-17 |
@@ -58,9 +58,9 @@
 | [p0-17](p0-17-wild-creature-lair-loop.md) | ✅ done | Wild creature and lair clearing loop | [shipwright](../team-leads/shipwright.md) | 2026-04-17 |
 | [p0-18](p0-18-strategic-resource-gate.md) | ✅ done | Strategic resources gate unit production (empire ledger) | — | 2026-04-17 |
 | [p0-19](p0-19-biome-economy-integration.md) | ✅ done | Biome-driven collectibles → tile yields → happiness end-to-end | — | 2026-04-16 |
-| [p0-20](p0-20-gpu-mcts-rollouts.md) | 🟡 partial | GPU-accelerated MCTS rollouts for look-ahead decision-making | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 |
+| [p0-20](p0-20-gpu-mcts-rollouts.md) | 🟡 partial | GPU-accelerated MCTS rollouts for look-ahead decision-making | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 |
 | [p0-21](p0-21-audio-system-capability.md) | ✅ done | Audio system capability — manifest + autoload + EventBus wiring | [shipwright](../team-leads/shipwright.md) | 2026-04-17 |
-| [p0-22](p0-22-ultimate-ai-stress-test.md) | 🟡 partial | Ultimate AI stress test — 5 clans, huge map, deep lookahead | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 |
+| [p0-22](p0-22-ultimate-ai-stress-test.md) | 🟡 partial | Ultimate AI stress test — 5 clans, huge map, deep lookahead | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 |
 | [p0-23](p0-23-sprite-rendering-capability.md) | ✅ done | Sprite rendering capability — replace procedural draw_* with texture rendering | [shipwright](../team-leads/shipwright.md) | 2026-04-17 |
 | [p0-24](p0-24-difficulty-calibrated-ai-progression.md) | ✅ done | Difficulty-calibrated AI progression — Easy / Normal / Hard tier-peak distributions | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 |
 | [p0-25](p0-25-game-quality-metrics-instrumentation.md) | ✅ done | Game-quality metrics instrumentation — tier_peak, peak_unit_tier, wonder_count | [shipwright](../team-leads/shipwright.md) | 2026-04-17 |
diff --git a/.project/objectives/p0-22-ultimate-ai-stress-test.md b/.project/objectives/p0-22-ultimate-ai-stress-test.md
index df41af06..b0f04b49 100644
--- a/.project/objectives/p0-22-ultimate-ai-stress-test.md
+++ b/.project/objectives/p0-22-ultimate-ai-stress-test.md
@@ -5,7 +5,7 @@ priority: p0
 status: partial
 scope: game1
 owner: warcouncil
-updated_at: 2026-04-18
+updated_at: 2026-04-19
 evidence:
   - src/simulator/crates/mc-ai/tests/ultimate_lookahead_stress.rs
   - tools/matchup-grid.sh
@@ -66,13 +66,17 @@ a foregone conclusion; the grid is the precondition.
   - `ai_personalities.json` still exports exactly 5 canonical clans
 - ✓ `python3 tools/test_matchup_and_ultimate.py` passes 26/26
   unit tests for matchup_balance and ultimate_stress verdict fns.
-- ✗ **`tools/matchup-grid.sh` → `matchup_balance: PASS`** — NOT yet run.
-  Structural blocker RESOLVED 2026-04-18: `MAP_SIZE` + `NUM_PLAYERS` env vars
-  now threaded through `scenes/tests/auto_play.gd` and both local-flatpak +
-  remote-ssh paths of `autoplay-batch.sh`. Batch execution pending; expected
-  to be gated by the shared p0-01 gameplay-balance issue (games resolve
-  T39-T100 via rush domination, so per-pair median-turn may fall below
-  ultimate_stress's ≥40% of cap threshold).
+- 🟡 **`tools/matchup-grid.sh` → `matchup_balance: PASS`** — IN PROGRESS 2026-04-19.
+  Batch `matchup-grid-20260419_000018` (5 seeds/pair, T300, `AI_USE_MCTS=true`):
+  **7/10 pairs complete** with exit=0 — ironhold_vs_goldvein, ironhold_vs_blackhammer,
+  ironhold_vs_deepforge, ironhold_vs_runesmith, goldvein_vs_blackhammer,
+  goldvein_vs_deepforge, goldvein_vs_runesmith. Remaining: blackhammer_vs_deepforge,
+  blackhammer_vs_runesmith, deepforge_vs_runesmith. Batch interrupted twice by
+  apricot OOM hard-poweroff (PARALLEL=16 Godot instances → memory spike during
+  simultaneous init). Fix landed: `LAUNCH_COOLDOWN` env var added to
+  `tools/autoplay-batch.sh` — staggers game launches N seconds apart to prevent
+  simultaneous peak-init memory pressure. Resume with `LAUNCH_COOLDOWN=15 PARALLEL=8`.
+  Verdict pending full 10/10 completion.
 - ✗ **`tools/huge-map-5clan.sh` → `ultimate_stress: PASS`** — NOT yet run.
   Same env-wiring resolved 2026-04-18. Batch execution pending behind the
   matchup-grid precondition and the p0-01 balance fix.
@@ -80,14 +84,13 @@ a foregone conclusion; the grid is the precondition.
 ## Remaining to reach done
 
 1. ~~**Game binary reads `MAP_SIZE` and `NUM_PLAYERS` env.**~~ DONE 2026-04-18.
-2. **Run matchup-grid** (C(5,2)=10 pairs × seeds). Cite verdict.
+2. **Complete matchup-grid** — 7/10 pairs done. Resume with `LAUNCH_COOLDOWN=15 PARALLEL=8`
+   to avoid OOM. Run `checklist-report.py matchup_balance` across full grid dir once 10/10 done.
 3. **Run huge-map-5clan** (5 clans on Civ5 `standard` 80×52 map).
-   Cite verdict.
-4. **Both batches likely gated by p0-01 gameplay-balance tune** — median-turn
-   gate in `ultimate_stress` requires ≥40% of cap (≥120 turns of 300). Current
-   binary resolves most games T39-T100 via rush-domination. Running these
-   batches now is expected to FAIL the verdict; waiting for p0-01's pacing
-   tune to land is the cost-effective sequencing.
+   Cite verdict. Blocked on matchup-grid PASS.
+4. **Median-turn gate concern**: `ultimate_stress` requires ≥40% of cap (≥120T of 300).
+   Post-p0-37+p0-39+tempo-bump binary now runs to median T192 — this gate should PASS.
+   The tier_peak and peak_unit_tier gates may still fail (gated by game-systems/data scope).
 5. **MAX_PLAYERS POD expansion** — NOT a blocker for p0-22 (the Civ5
    `standard` 80×52 runs 8 players but our 5-clan ultimate only needs
    5). If we later want to run the actual canonical `huge` (128×80,
diff --git a/public/games/age-of-dwarves/data/objectives.json b/public/games/age-of-dwarves/data/objectives.json
index c6be1da1..d95db4f9 100644
--- a/public/games/age-of-dwarves/data/objectives.json
+++ b/public/games/age-of-dwarves/data/objectives.json
@@ -1,11 +1,11 @@
 {
-  "generated_at": "2026-04-19T05:32:21Z",
+  "generated_at": "2026-04-19T22:50:14Z",
   "totals": {
-    "missing": 8,
-    "partial": 16,
-    "stub": 3,
     "done": 60,
+    "stub": 3,
     "oos": 18,
+    "partial": 16,
+    "missing": 8,
     "total": 105
   },
   "objectives": [
@@ -26,7 +26,7 @@
       "status": "partial",
       "scope": "game1",
       "owner": "warcouncil",
-      "updated_at": "2026-04-18",
+      "updated_at": "2026-04-19",
       "summary": "`ai_personalities.json` defines Ironhold / Goldvein / Blackhammer / Deepforge / Runesmith with 6-axis `strategic_axes`. `ScoringWeights::from_personality` and `apply_axes` are fully implemented in `mc-ai/src/evaluator.rs`.\n\nWired 2026-04-17: `GdMcTreeController::scoring_weights_for_clan(clan_id, data_dir)` resolves per-clan weights via GDExtension. `ai_turn_bridge.gd::_build_game_state_json` now calls this per player and injects the result into `\"scoring_weights\":` — previously always `{}`. `AI_PIN_PERSONALITY` env var added to `personality_assigner.gd` for per-clan batch testing. Smoke run confirms `player_clans: {\"1\": \"blackhammer\"}` in meta.json, EXIT_CODE=0.\n\n**5 × 10-seed batch results (2026-04-17, `.local/iter/p0-02-clans/` — PRE-REFRAME EVIDENCE):**\n\n> These batches ran BEFORE p0-25's instrumentation landed, so `player_stats` does NOT carry\n> `tier_peak` / `peak_unit_tier` / `wonder_count`. The TTV column is preserved as the\n> contemporaneous signal; it is NOT the current acceptance metric. Per p0-01's 2026-04-17\n> reframe, the primary divergence gate is **tier_peak** (era-progression, which scales with\n> difficulty per p0-24) — tracked as a \"needs re-run\" in Remaining to reach done below.\n\n| Clan | Wins | TTV_med (legacy) | p1_gold | p1_mil | p1_techs |\n|---|---|---|---|---|---|\n| ironhold | 10/10 | T185.5 | 266 | 3.0 | 27.5 |\n| goldvein | 10/10 | T155.5 | **543** | 3.5 | 25.5 |\n| blackhammer | 9/9 | T189 | 327 | 3.0 | 28 |\n| deepforge | 10/10 | T185.5 | 266 | 3.0 | 27.5 |\n| runesmith | 10/10 | T155.5 | 543 | 3.5 | 25.5 |\n\nSignals that DON'T depend on TTV (still valid post-reframe):\n- **Balance**: 49 total games, each clan 3 AI-wins, max 33% — passes.\n- **Gold axis**: goldvein 2× ironhold (wealth=9 vs 3) — passes.\n- **First-combat**: identical at T9 across all clans (map-forced start proximity, not AI-driven).\n- **Pair metric-identical**: deepforge/ironhold and goldvein/runesmith pairs show overlapping weight profiles; same 10 seeds converge.\n\nSignals that DO depend on TTV (need tier_peak re-run to close the reframed gate):\n- TTV delta between clan pairs — the \"goldvein/runesmith finish 30 turns faster than ironhold/deepforge\" claim doesn't translate into the tier_peak framework until re-measured.\n\n**B5 re-run (2026-04-17, `.local/iter/b5-manual-20260417_061957/`, 50 games, post-determinism-fix binary):** blackhammer 0/10 wins; AI wins only 9/50 overall (18%). Win-rate balance bullet fails. See \"Remaining to done\" for tuning plan.\n\n**Axis ablation sweep (2026-04-17, `.local/iter/ablate_<axis>_20260417_072921/`, 10 seeds T300 per axis — PRE-REFRAME EVIDENCE):** Each axis neutralized to 5 for all clans. Measured under pre-p0-25 instrumentation; metrics are TTV / gold / mil from the legacy `player_stats` schema. All 6 axes show ≥10% delta on their correlated legacy metric vs pooled baseline (TTV=185, gold=379, mil=3):\n\n| Axis | Correlated metric (legacy) | Baseline | Ablated | Delta |\n|---|---|---|---|---|\n| aggression | mil_med | 3.0 | 2.5 | -16.7% |\n| expansion | ttv_med | 185 | 134 | -27.6% |\n| grudge_persistence | ttv_med | 185 | 131.5 | -28.9% |\n| production | ttv_med | 185 | 139 | -24.9% |\n| trade_willingness | gold_med | 379 | 193.5 | -48.9% |\n| wealth | gold_med | 379 | 227.5 | -40.0% |\n\nNote: ablated TTV drops (not rises) because most games hit T300 stalemate when the axis is neutralized — domination wins collapse from 49/49 to 1–8/10 per axis. The TTV delta reflects game degradation, not faster play. All axes CONFIRMED LIVE under the legacy metric set. Re-measurement under tier_peak is needed before the reframed acceptance (below) can be cited."
     },
     {
@@ -206,7 +206,7 @@
       "status": "partial",
       "scope": "game1",
       "owner": "warcouncil",
-      "updated_at": "2026-04-18",
+      "updated_at": "2026-04-19",
       "summary": "The MCTS tree (`mcts_tree.rs`) and the `mc-turn` GPU fauna pipeline are both live\non `main`, but the AI cannot currently afford wide tree search: full\n`GridState` cloning (~12 MB at 256×256) blows out RAM long before the tree is\ndeep enough to matter, and `TreeState::simulate()` is a 0.5 stub. This objective\nintroduces a **GPU-batched abstract rollout** layer so the tree search can\nevaluate hundreds of candidate futures per leaf at single-digit-millisecond\ncost.\n\n### 2026-04-17 update — GPU↔CPU numerical parity ACHIEVED\n\nPhase C structural work shipped in the earlier team pass but the parity test\nwas silently taking the skip path on headless hosts — the shader had never\nactually compiled on any adapter. A deep audit + four independent fixes landed\nthis cycle proving real numerical parity:\n\n1. **WGSL reserved-keyword bug**: `var active: u32 = 0u` at `rollout.wgsl:607`\n   used the `active` reserved word → Naga parse panic → wgpu_core handler → try_init\n   worker thread panic → timeout returned None → skip-path. Renamed to\n   `active_idx`; the shader now actually compiles. Without this, the skip-path\n   was structurally \"passing\" every test in Phase C without ever exercising the\n   WGSL kernel.\n2. **Adapter backend restriction**: `wgpu::Backends::all()` picked the NVIDIA\n   OpenGL adapter first on apricot, whose compute support silently fails at\n   `request_device`. Restricted to `VULKAN | METAL | DX12 | BROWSER_WEBGPU`\n   which all have first-class compute paths.\n3. **Device limits fix**: `Limits::default()` targets a discrete GPU — too\n   large for llvmpipe / lavapipe. Changed to\n   `Limits::downlevel_defaults().using_resolution(adapter.limits())` so software\n   Vulkan backends can satisfy device creation.\n4. **Action-walk order unified**: the root numerical divergence. CPU\n   `active_actions()` returned actions in insertion order\n   `[Build, Research, Defend, Idle, Attack, ...]`; WGSL iterated k=0..9 in\n   `ActionKind::ALL` numerical order `[Build, Attack, Settle, Research, ...]`.\n   Identical probabilities, identical RNG draw → different action picked at\n   every cumulative-sum boundary. Rewrote `active_actions()` to iterate\n   `ActionKind::ALL` in canonical order (with explicit docstring warning not\n   to reorder for readability).\n\n**Parity verification on apricot (headless bluefin + lavapipe software\nVulkan)**: with `MC_AI_GPU_DEBUG=1 VK_DRIVER_FILES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json`\ndriving the tests on real llvmpipe dispatch, not skip-path:\n\n```\n[parity small_batch backend=Vulkan]       n=16  agree=16/16  (1.000)  max_drift=0.000000\n[parity partial_workgroup backend=Vulkan] n=65  agree=65/65  (1.000)  max_drift=0.000000\n[parity multi_workgroup backend=Vulkan]   n=128 agree=128/128 (1.000) max_drift=0.000000\nbuckets: <1e-6=all others=0 across all three tests\n```\n\nNot 98% (the stated tolerance) — **100% agreement, bit-identical** on all 3\nquantitative parity tests (209 inputs total). Pre-fixes: 3–6% agreement with\nmax_drift 0.025–0.043 (action-boundary flips). Post-fix: integer fields\nbyte-equal, scalar fields byte-equal. WGSL kernel is now a provable,\nbyte-for-byte port of `rollout::walk`.\n\n### 2026-04-17 update — host-side infrastructure\n\n- `scripts/dev-setup/bluefin.sh` + `./run setup:bluefin` — idempotent installer\n  for `weston`, `vulkan-tools`, `mesa-vulkan-drivers` on bootc/Bluefin systems\n  via `rpm-ostree install --apply-live`. `--check` mode for CI.\n  Delegates EDIT→RUN via `$AUTOPLAY_HOST` when invoked from EDIT.\n- `~/Code/bootc-bluefin/containerfiles/Containerfile.desktop-core` updated on\n  apricot with `vulkan-tools` + `mesa-vulkan-drivers` added alongside `weston`.\n  Rebooted bootc images now include these without needing the transient script.\n\n### 2026-04-17 update — fresh A5 attempt post-fix (failed on host SIGTERM)\n\nAfter the four WGSL parity fixes landed and GDExtension rebuilt, fresh A5\nbatches were attempted under multiple process-isolation strategies:\n\n| Strategy | Batch dir | Result |\n|---|---|---|\n| plain nohup | `.local/iter/a5-fresh-20260417_122847/` | exit 143, seeds `in_progress` T5–T10 before kill |\n| nohup + new dir | `.local/iter/a5-final-20260417_122936/` | games launched, no completion.marker written (process killed) |\n| bash SIGTERM trap | `.local/iter/a5-trap-20260417_123021/` | trap handler received NO signal; script exited rc=143 |\n| strace signal trace | `.local/iter/a5-strace-20260417_123200/` | revealed autoplay-batch.sh exits status **1** (not 143); no SIGTERM to parent. Root cause: `0/N games produced turn_stats.jsonl` check fires because flatpak Godot scopes end at 3–10s |\n| `systemd-run --user` | `.local/iter/warcouncil-a5-systemd-*/` | same — service `Active: inactive (dead)` after 2s, scope children SIGTERMed |\n| `KillMode=none` | `.local/iter/warcouncil-a5-systemd-*` (2nd) | games reached T9–T10 only; same kill pattern |\n| plain `bash autoplay-batch` synchronous | `.local/iter/a5-direct-123300/` | 10 games with 0-line `turn_stats.jsonl` — games get SIGTERMed during map generation |\n\nSeven distinct execution strategies, same failure pattern: flatpak Godot\nscopes SIGTERMed within 3–10s of launch, before any turn completes. Investigation\nfound the signal is NOT delivered by systemd-oomd (failed service), rpm-ostree\nautomatic updates (timer inactive), or apricot-rail-watchdog (emit-only). The\nactual SIGTERM source could not be identified in the apricot user session.\nParallel agent's own batches from earlier the same day (e.g.\n`.local/batches/blackhammer_tune_20260417_101447/`) completed fine, so the\nissue is transient/session-bound, NOT a permanent host failure.\n\n**Fresh A5 verdict — NOT HEALTHY, B5 therefore not launched.** Per\nwarcouncil's integrity rule: we report the measurement failure honestly\nrather than claim parity-fix-correctness translated into fresh gameplay\nevidence. Existing p0-01 batch data from pre-parity-fix binary (at\n`blackhammer_tune_20260417_101447`) still stands as the most recent\nsuccessful A5/B5 evidence in the repo."
     },
     {
@@ -226,7 +226,7 @@
       "status": "partial",
       "scope": "game1",
       "owner": "warcouncil",
-      "updated_at": "2026-04-18",
+      "updated_at": "2026-04-19",
       "summary": "The \"ultimate test\" is the final gate on the AI lookahead pipeline:\nfive clan personalities competing on a map sized large enough for eight\nplayers, with MCTS + GPU batched rollouts driving every decision. The\ngoal is to confirm the lookahead SCALES — deep trees, many expansions,\ngenuine strategic divergence between clans at multi-clan scale — not\njust that it works on the 1v1 fixtures already covered by p0-02's\n`personality_win_balance`.\n\nPer project owner: the ultimate test runs ONLY AFTER the C(5,2)=10-pair\n1v1 matchup grid (`tools/matchup-grid.sh`) has shown the five clans are\nbalanced in head-to-head play. Unbalanced 1v1s make a 5-way free-for-all\na foregone conclusion; the grid is the precondition."
     },
     {