From d2dd264027d13c428594db23f8e9b32d2a51eeb9 Mon Sep 17 00:00:00 2001 From: Natalie Date: Sun, 19 Apr 2026 15:53:32 -0700 Subject: [PATCH] =?UTF-8?q?fix(@projects/@magic-civilization):=20?= =?UTF-8?q?=F0=9F=90=9B=20update=20stress-test=20objective=20date?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- .project/objectives/README.md | 6 ++-- .../p0-22-ultimate-ai-stress-test.md | 33 ++++++++++--------- .../games/age-of-dwarves/data/objectives.json | 14 ++++---- 3 files changed, 28 insertions(+), 25 deletions(-) diff --git a/.project/objectives/README.md b/.project/objectives/README.md index 4c1ad33f..3c52843e 100644 --- a/.project/objectives/README.md +++ b/.project/objectives/README.md @@ -40,7 +40,7 @@ | ID | Status | Title | Owner | Updated | |---|---|---|---|---| | [p0-01](p0-01-mcts-wiring.md) | 🟑 partial | Wire MCTS into gameplay AI | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 | -| [p0-02](p0-02-clan-personalities.md) | 🟑 partial | Five AI clan personalities drive distinct playstyles | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 | +| [p0-02](p0-02-clan-personalities.md) | 🟑 partial | Five AI clan personalities drive distinct playstyles | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 | | [p0-03](p0-03-pvp-in-turn.md) | βœ… done | PvP combat resolved inside the authoritative turn processor | β€” | 2026-04-17 | | [p0-04](p0-04-wonder-tracking.md) | βœ… done | World wonder tracking in PlayerState and score victory | β€” | 2026-04-17 | | [p0-05](p0-05-culture-and-borders.md) | βœ… done | Culture generation and border expansion | [shipwright](../team-leads/shipwright.md) | 2026-04-17 | @@ -58,9 +58,9 @@ | [p0-17](p0-17-wild-creature-lair-loop.md) | βœ… done | Wild creature and lair clearing loop | [shipwright](../team-leads/shipwright.md) | 2026-04-17 | | [p0-18](p0-18-strategic-resource-gate.md) | βœ… done | Strategic resources gate unit production (empire ledger) | β€” | 2026-04-17 | | [p0-19](p0-19-biome-economy-integration.md) | βœ… done | Biome-driven collectibles β†’ tile yields β†’ happiness end-to-end | β€” | 2026-04-16 | -| [p0-20](p0-20-gpu-mcts-rollouts.md) | 🟑 partial | GPU-accelerated MCTS rollouts for look-ahead decision-making | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 | +| [p0-20](p0-20-gpu-mcts-rollouts.md) | 🟑 partial | GPU-accelerated MCTS rollouts for look-ahead decision-making | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 | | [p0-21](p0-21-audio-system-capability.md) | βœ… done | Audio system capability β€” manifest + autoload + EventBus wiring | [shipwright](../team-leads/shipwright.md) | 2026-04-17 | -| [p0-22](p0-22-ultimate-ai-stress-test.md) | 🟑 partial | Ultimate AI stress test β€” 5 clans, huge map, deep lookahead | [warcouncil](../team-leads/warcouncil.md) | 2026-04-18 | +| [p0-22](p0-22-ultimate-ai-stress-test.md) | 🟑 partial | Ultimate AI stress test β€” 5 clans, huge map, deep lookahead | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 | | [p0-23](p0-23-sprite-rendering-capability.md) | βœ… done | Sprite rendering capability β€” replace procedural draw_* with texture rendering | [shipwright](../team-leads/shipwright.md) | 2026-04-17 | | [p0-24](p0-24-difficulty-calibrated-ai-progression.md) | βœ… done | Difficulty-calibrated AI progression β€” Easy / Normal / Hard tier-peak distributions | [warcouncil](../team-leads/warcouncil.md) | 2026-04-19 | | [p0-25](p0-25-game-quality-metrics-instrumentation.md) | βœ… done | Game-quality metrics instrumentation β€” tier_peak, peak_unit_tier, wonder_count | [shipwright](../team-leads/shipwright.md) | 2026-04-17 | diff --git a/.project/objectives/p0-22-ultimate-ai-stress-test.md b/.project/objectives/p0-22-ultimate-ai-stress-test.md index df41af06..b0f04b49 100644 --- a/.project/objectives/p0-22-ultimate-ai-stress-test.md +++ b/.project/objectives/p0-22-ultimate-ai-stress-test.md @@ -5,7 +5,7 @@ priority: p0 status: partial scope: game1 owner: warcouncil -updated_at: 2026-04-18 +updated_at: 2026-04-19 evidence: - src/simulator/crates/mc-ai/tests/ultimate_lookahead_stress.rs - tools/matchup-grid.sh @@ -66,13 +66,17 @@ a foregone conclusion; the grid is the precondition. - `ai_personalities.json` still exports exactly 5 canonical clans - βœ“ `python3 tools/test_matchup_and_ultimate.py` passes 26/26 unit tests for matchup_balance and ultimate_stress verdict fns. -- βœ— **`tools/matchup-grid.sh` β†’ `matchup_balance: PASS`** β€” NOT yet run. - Structural blocker RESOLVED 2026-04-18: `MAP_SIZE` + `NUM_PLAYERS` env vars - now threaded through `scenes/tests/auto_play.gd` and both local-flatpak + - remote-ssh paths of `autoplay-batch.sh`. Batch execution pending; expected - to be gated by the shared p0-01 gameplay-balance issue (games resolve - T39-T100 via rush domination, so per-pair median-turn may fall below - ultimate_stress's β‰₯40% of cap threshold). +- 🟑 **`tools/matchup-grid.sh` β†’ `matchup_balance: PASS`** β€” IN PROGRESS 2026-04-19. + Batch `matchup-grid-20260419_000018` (5 seeds/pair, T300, `AI_USE_MCTS=true`): + **7/10 pairs complete** with exit=0 β€” ironhold_vs_goldvein, ironhold_vs_blackhammer, + ironhold_vs_deepforge, ironhold_vs_runesmith, goldvein_vs_blackhammer, + goldvein_vs_deepforge, goldvein_vs_runesmith. Remaining: blackhammer_vs_deepforge, + blackhammer_vs_runesmith, deepforge_vs_runesmith. Batch interrupted twice by + apricot OOM hard-poweroff (PARALLEL=16 Godot instances β†’ memory spike during + simultaneous init). Fix landed: `LAUNCH_COOLDOWN` env var added to + `tools/autoplay-batch.sh` β€” staggers game launches N seconds apart to prevent + simultaneous peak-init memory pressure. Resume with `LAUNCH_COOLDOWN=15 PARALLEL=8`. + Verdict pending full 10/10 completion. - βœ— **`tools/huge-map-5clan.sh` β†’ `ultimate_stress: PASS`** β€” NOT yet run. Same env-wiring resolved 2026-04-18. Batch execution pending behind the matchup-grid precondition and the p0-01 balance fix. @@ -80,14 +84,13 @@ a foregone conclusion; the grid is the precondition. ## Remaining to reach done 1. ~~**Game binary reads `MAP_SIZE` and `NUM_PLAYERS` env.**~~ DONE 2026-04-18. -2. **Run matchup-grid** (C(5,2)=10 pairs Γ— seeds). Cite verdict. +2. **Complete matchup-grid** β€” 7/10 pairs done. Resume with `LAUNCH_COOLDOWN=15 PARALLEL=8` + to avoid OOM. Run `checklist-report.py matchup_balance` across full grid dir once 10/10 done. 3. **Run huge-map-5clan** (5 clans on Civ5 `standard` 80Γ—52 map). - Cite verdict. -4. **Both batches likely gated by p0-01 gameplay-balance tune** β€” median-turn - gate in `ultimate_stress` requires β‰₯40% of cap (β‰₯120 turns of 300). Current - binary resolves most games T39-T100 via rush-domination. Running these - batches now is expected to FAIL the verdict; waiting for p0-01's pacing - tune to land is the cost-effective sequencing. + Cite verdict. Blocked on matchup-grid PASS. +4. **Median-turn gate concern**: `ultimate_stress` requires β‰₯40% of cap (β‰₯120T of 300). + Post-p0-37+p0-39+tempo-bump binary now runs to median T192 β€” this gate should PASS. + The tier_peak and peak_unit_tier gates may still fail (gated by game-systems/data scope). 5. **MAX_PLAYERS POD expansion** β€” NOT a blocker for p0-22 (the Civ5 `standard` 80Γ—52 runs 8 players but our 5-clan ultimate only needs 5). If we later want to run the actual canonical `huge` (128Γ—80, diff --git a/public/games/age-of-dwarves/data/objectives.json b/public/games/age-of-dwarves/data/objectives.json index c6be1da1..d95db4f9 100644 --- a/public/games/age-of-dwarves/data/objectives.json +++ b/public/games/age-of-dwarves/data/objectives.json @@ -1,11 +1,11 @@ { - "generated_at": "2026-04-19T05:32:21Z", + "generated_at": "2026-04-19T22:50:14Z", "totals": { - "missing": 8, - "partial": 16, - "stub": 3, "done": 60, + "stub": 3, "oos": 18, + "partial": 16, + "missing": 8, "total": 105 }, "objectives": [ @@ -26,7 +26,7 @@ "status": "partial", "scope": "game1", "owner": "warcouncil", - "updated_at": "2026-04-18", + "updated_at": "2026-04-19", "summary": "`ai_personalities.json` defines Ironhold / Goldvein / Blackhammer / Deepforge / Runesmith with 6-axis `strategic_axes`. `ScoringWeights::from_personality` and `apply_axes` are fully implemented in `mc-ai/src/evaluator.rs`.\n\nWired 2026-04-17: `GdMcTreeController::scoring_weights_for_clan(clan_id, data_dir)` resolves per-clan weights via GDExtension. `ai_turn_bridge.gd::_build_game_state_json` now calls this per player and injects the result into `\"scoring_weights\":` β€” previously always `{}`. `AI_PIN_PERSONALITY` env var added to `personality_assigner.gd` for per-clan batch testing. Smoke run confirms `player_clans: {\"1\": \"blackhammer\"}` in meta.json, EXIT_CODE=0.\n\n**5 Γ— 10-seed batch results (2026-04-17, `.local/iter/p0-02-clans/` β€” PRE-REFRAME EVIDENCE):**\n\n> These batches ran BEFORE p0-25's instrumentation landed, so `player_stats` does NOT carry\n> `tier_peak` / `peak_unit_tier` / `wonder_count`. The TTV column is preserved as the\n> contemporaneous signal; it is NOT the current acceptance metric. Per p0-01's 2026-04-17\n> reframe, the primary divergence gate is **tier_peak** (era-progression, which scales with\n> difficulty per p0-24) β€” tracked as a \"needs re-run\" in Remaining to reach done below.\n\n| Clan | Wins | TTV_med (legacy) | p1_gold | p1_mil | p1_techs |\n|---|---|---|---|---|---|\n| ironhold | 10/10 | T185.5 | 266 | 3.0 | 27.5 |\n| goldvein | 10/10 | T155.5 | **543** | 3.5 | 25.5 |\n| blackhammer | 9/9 | T189 | 327 | 3.0 | 28 |\n| deepforge | 10/10 | T185.5 | 266 | 3.0 | 27.5 |\n| runesmith | 10/10 | T155.5 | 543 | 3.5 | 25.5 |\n\nSignals that DON'T depend on TTV (still valid post-reframe):\n- **Balance**: 49 total games, each clan 3 AI-wins, max 33% β€” passes.\n- **Gold axis**: goldvein 2Γ— ironhold (wealth=9 vs 3) β€” passes.\n- **First-combat**: identical at T9 across all clans (map-forced start proximity, not AI-driven).\n- **Pair metric-identical**: deepforge/ironhold and goldvein/runesmith pairs show overlapping weight profiles; same 10 seeds converge.\n\nSignals that DO depend on TTV (need tier_peak re-run to close the reframed gate):\n- TTV delta between clan pairs β€” the \"goldvein/runesmith finish 30 turns faster than ironhold/deepforge\" claim doesn't translate into the tier_peak framework until re-measured.\n\n**B5 re-run (2026-04-17, `.local/iter/b5-manual-20260417_061957/`, 50 games, post-determinism-fix binary):** blackhammer 0/10 wins; AI wins only 9/50 overall (18%). Win-rate balance bullet fails. See \"Remaining to done\" for tuning plan.\n\n**Axis ablation sweep (2026-04-17, `.local/iter/ablate__20260417_072921/`, 10 seeds T300 per axis β€” PRE-REFRAME EVIDENCE):** Each axis neutralized to 5 for all clans. Measured under pre-p0-25 instrumentation; metrics are TTV / gold / mil from the legacy `player_stats` schema. All 6 axes show β‰₯10% delta on their correlated legacy metric vs pooled baseline (TTV=185, gold=379, mil=3):\n\n| Axis | Correlated metric (legacy) | Baseline | Ablated | Delta |\n|---|---|---|---|---|\n| aggression | mil_med | 3.0 | 2.5 | -16.7% |\n| expansion | ttv_med | 185 | 134 | -27.6% |\n| grudge_persistence | ttv_med | 185 | 131.5 | -28.9% |\n| production | ttv_med | 185 | 139 | -24.9% |\n| trade_willingness | gold_med | 379 | 193.5 | -48.9% |\n| wealth | gold_med | 379 | 227.5 | -40.0% |\n\nNote: ablated TTV drops (not rises) because most games hit T300 stalemate when the axis is neutralized β€” domination wins collapse from 49/49 to 1–8/10 per axis. The TTV delta reflects game degradation, not faster play. All axes CONFIRMED LIVE under the legacy metric set. Re-measurement under tier_peak is needed before the reframed acceptance (below) can be cited." }, { @@ -206,7 +206,7 @@ "status": "partial", "scope": "game1", "owner": "warcouncil", - "updated_at": "2026-04-18", + "updated_at": "2026-04-19", "summary": "The MCTS tree (`mcts_tree.rs`) and the `mc-turn` GPU fauna pipeline are both live\non `main`, but the AI cannot currently afford wide tree search: full\n`GridState` cloning (~12 MB at 256Γ—256) blows out RAM long before the tree is\ndeep enough to matter, and `TreeState::simulate()` is a 0.5 stub. This objective\nintroduces a **GPU-batched abstract rollout** layer so the tree search can\nevaluate hundreds of candidate futures per leaf at single-digit-millisecond\ncost.\n\n### 2026-04-17 update β€” GPU↔CPU numerical parity ACHIEVED\n\nPhase C structural work shipped in the earlier team pass but the parity test\nwas silently taking the skip path on headless hosts β€” the shader had never\nactually compiled on any adapter. A deep audit + four independent fixes landed\nthis cycle proving real numerical parity:\n\n1. **WGSL reserved-keyword bug**: `var active: u32 = 0u` at `rollout.wgsl:607`\n used the `active` reserved word β†’ Naga parse panic β†’ wgpu_core handler β†’ try_init\n worker thread panic β†’ timeout returned None β†’ skip-path. Renamed to\n `active_idx`; the shader now actually compiles. Without this, the skip-path\n was structurally \"passing\" every test in Phase C without ever exercising the\n WGSL kernel.\n2. **Adapter backend restriction**: `wgpu::Backends::all()` picked the NVIDIA\n OpenGL adapter first on apricot, whose compute support silently fails at\n `request_device`. Restricted to `VULKAN | METAL | DX12 | BROWSER_WEBGPU`\n which all have first-class compute paths.\n3. **Device limits fix**: `Limits::default()` targets a discrete GPU β€” too\n large for llvmpipe / lavapipe. Changed to\n `Limits::downlevel_defaults().using_resolution(adapter.limits())` so software\n Vulkan backends can satisfy device creation.\n4. **Action-walk order unified**: the root numerical divergence. CPU\n `active_actions()` returned actions in insertion order\n `[Build, Research, Defend, Idle, Attack, ...]`; WGSL iterated k=0..9 in\n `ActionKind::ALL` numerical order `[Build, Attack, Settle, Research, ...]`.\n Identical probabilities, identical RNG draw β†’ different action picked at\n every cumulative-sum boundary. Rewrote `active_actions()` to iterate\n `ActionKind::ALL` in canonical order (with explicit docstring warning not\n to reorder for readability).\n\n**Parity verification on apricot (headless bluefin + lavapipe software\nVulkan)**: with `MC_AI_GPU_DEBUG=1 VK_DRIVER_FILES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json`\ndriving the tests on real llvmpipe dispatch, not skip-path:\n\n```\n[parity small_batch backend=Vulkan] n=16 agree=16/16 (1.000) max_drift=0.000000\n[parity partial_workgroup backend=Vulkan] n=65 agree=65/65 (1.000) max_drift=0.000000\n[parity multi_workgroup backend=Vulkan] n=128 agree=128/128 (1.000) max_drift=0.000000\nbuckets: <1e-6=all others=0 across all three tests\n```\n\nNot 98% (the stated tolerance) β€” **100% agreement, bit-identical** on all 3\nquantitative parity tests (209 inputs total). Pre-fixes: 3–6% agreement with\nmax_drift 0.025–0.043 (action-boundary flips). Post-fix: integer fields\nbyte-equal, scalar fields byte-equal. WGSL kernel is now a provable,\nbyte-for-byte port of `rollout::walk`.\n\n### 2026-04-17 update β€” host-side infrastructure\n\n- `scripts/dev-setup/bluefin.sh` + `./run setup:bluefin` β€” idempotent installer\n for `weston`, `vulkan-tools`, `mesa-vulkan-drivers` on bootc/Bluefin systems\n via `rpm-ostree install --apply-live`. `--check` mode for CI.\n Delegates EDITβ†’RUN via `$AUTOPLAY_HOST` when invoked from EDIT.\n- `~/Code/bootc-bluefin/containerfiles/Containerfile.desktop-core` updated on\n apricot with `vulkan-tools` + `mesa-vulkan-drivers` added alongside `weston`.\n Rebooted bootc images now include these without needing the transient script.\n\n### 2026-04-17 update β€” fresh A5 attempt post-fix (failed on host SIGTERM)\n\nAfter the four WGSL parity fixes landed and GDExtension rebuilt, fresh A5\nbatches were attempted under multiple process-isolation strategies:\n\n| Strategy | Batch dir | Result |\n|---|---|---|\n| plain nohup | `.local/iter/a5-fresh-20260417_122847/` | exit 143, seeds `in_progress` T5–T10 before kill |\n| nohup + new dir | `.local/iter/a5-final-20260417_122936/` | games launched, no completion.marker written (process killed) |\n| bash SIGTERM trap | `.local/iter/a5-trap-20260417_123021/` | trap handler received NO signal; script exited rc=143 |\n| strace signal trace | `.local/iter/a5-strace-20260417_123200/` | revealed autoplay-batch.sh exits status **1** (not 143); no SIGTERM to parent. Root cause: `0/N games produced turn_stats.jsonl` check fires because flatpak Godot scopes end at 3–10s |\n| `systemd-run --user` | `.local/iter/warcouncil-a5-systemd-*/` | same β€” service `Active: inactive (dead)` after 2s, scope children SIGTERMed |\n| `KillMode=none` | `.local/iter/warcouncil-a5-systemd-*` (2nd) | games reached T9–T10 only; same kill pattern |\n| plain `bash autoplay-batch` synchronous | `.local/iter/a5-direct-123300/` | 10 games with 0-line `turn_stats.jsonl` β€” games get SIGTERMed during map generation |\n\nSeven distinct execution strategies, same failure pattern: flatpak Godot\nscopes SIGTERMed within 3–10s of launch, before any turn completes. Investigation\nfound the signal is NOT delivered by systemd-oomd (failed service), rpm-ostree\nautomatic updates (timer inactive), or apricot-rail-watchdog (emit-only). The\nactual SIGTERM source could not be identified in the apricot user session.\nParallel agent's own batches from earlier the same day (e.g.\n`.local/batches/blackhammer_tune_20260417_101447/`) completed fine, so the\nissue is transient/session-bound, NOT a permanent host failure.\n\n**Fresh A5 verdict β€” NOT HEALTHY, B5 therefore not launched.** Per\nwarcouncil's integrity rule: we report the measurement failure honestly\nrather than claim parity-fix-correctness translated into fresh gameplay\nevidence. Existing p0-01 batch data from pre-parity-fix binary (at\n`blackhammer_tune_20260417_101447`) still stands as the most recent\nsuccessful A5/B5 evidence in the repo." }, { @@ -226,7 +226,7 @@ "status": "partial", "scope": "game1", "owner": "warcouncil", - "updated_at": "2026-04-18", + "updated_at": "2026-04-19", "summary": "The \"ultimate test\" is the final gate on the AI lookahead pipeline:\nfive clan personalities competing on a map sized large enough for eight\nplayers, with MCTS + GPU batched rollouts driving every decision. The\ngoal is to confirm the lookahead SCALES β€” deep trees, many expansions,\ngenuine strategic divergence between clans at multi-clan scale β€” not\njust that it works on the 1v1 fixtures already covered by p0-02's\n`personality_win_balance`.\n\nPer project owner: the ultimate test runs ONLY AFTER the C(5,2)=10-pair\n1v1 matchup grid (`tools/matchup-grid.sh`) has shown the five clans are\nbalanced in head-to-head play. Unbalanced 1v1s make a 5-way free-for-all\na foregone conclusion; the grid is the precondition." }, {