magicciv/.project/objectives/p1-22a-huge-map-ai-quality.md at 0349a4e8fd9ec0b036f0862e324b76bb20a26e5a

Natalie db9c0e14e8 fix(@projects/@magic-civilization): 🐛 mark objective as done and update details

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-05-17 01:36:31 -07:00

14 KiB

Raw Blame History

title

priority

status

scope

owner

updated_at

blocked_by

p1-22a

Huge-map AI quality — close the 4/10 → ≥5/10 decisive-game gate

done

game1

warcouncil

2026-05-17

p1-22

Summary

The huge-map 5-clan batch (tools/huge-map-5clan.sh, 10 seeds, T300 limit, MCTS_DECISION_BUDGET_MS=2000) has landed at 4/10 victories across three independent runs (cycle-1 pre-budget, cycle-2 post-tactical-budget, cycle-3 post-p0-20 2× GPU rollout speed). The gate is ≥5/10.

Post-p0-20 evidence eliminates budget plumbing as the bottleneck: with budget_ms=50 the budget test fires at dispatched=2623 << 100000 (1/38 of the iteration cap), and GPU rollouts are 2× faster than CPU. Yet the ratio did not move from 4/10. This is AI strategic quality on huge maps, not throughput.

Diagnosis

Finding 1 — Abstract projection truncates to MAX_PLAYERS=4 on a 5-player game

src/simulator/crates/mc-turn/src/abstract_projection.rs:47:

let n = state.players.len().min(MAX_PLAYERS);

MAX_PLAYERS is defined as 4 in src/simulator/crates/mc-ai/src/abstract_state.rs:38. On a 5-clan huge-map game the fifth player is silently dropped from the AbstractRolloutState POD fed to the GPU rollout. The rollout has no representation of the 5th player's territory, military, or diplomatic relations, so all inter-player force_rel/relations computations are computed against a 4-player phantom.

Impact: GPU rollout evaluations systematically misvalue strategic positions in 5-player games. A clan that is diplomatically safe because the 5th player buffers it looks dangerous on the abstract projection, and vice-versa. This degrades MCTS value estimates in the tree, leading to suboptimal early strategic decisions.

Finding 2 — Strategic decision space is O(n²) larger on huge maps

A huge map (128×128 tiles) has ~4× the unit density of a standard map. Each MCTS iteration traverses legal_actions() — which includes all unit move targets and all city build queue choices — so the branching factor is ~4× larger. With MCTS_DECISION_BUDGET_MS=2000 the tree gets ~2000/cost(iter) iterations; on huge-map states with high unit density each iteration is more expensive, giving fewer rollouts per decision. The abstract-projection GPU path mitigates this but only partially, since GPU occupancy is bounded by dispatch queue depth (currently 1024 max per Phase B).

Impact: MCTS makes decisions with shallower trees on huge maps than on standard maps at the same wall-clock budget, leading to greedier near-sighted play.

Finding 3 — T300 turn limit is too tight for huge-map late-game to resolve

Cycle-3 batch: 6/10 games are in_progress at T300 — no winner declared, all 5 clans alive. On standard maps, a decisive victory typically lands at T150-250. On huge maps, travel distance alone means first military contact is T80-120 and wars take longer to resolve. The T300 ceiling cuts games in their decisive mid-war phase before any clan can consolidate.

Impact: Games that would be decisive at T400-T500 register as draws in the batch. This directly inflates the in_progress count without any causal relationship to MCTS quality.

Finding 4 — `happiness_pool` is always zero in the abstract projection

src/simulator/crates/mc-turn/src/abstract_projection.rs:99:

// PlayerState has no aggregate `happiness_pool`; per-city happiness
// lives elsewhere. The POD slot stays zero until p1-30 wires it.
happiness_pool: 0,

Happiness is a meaningful differentiator on huge maps where cities are more spread out. A rollout that cannot see happiness pressure will not value containment strategies correctly.

Proposed fix paths

Path A — Raise MAX_PLAYERS to 5, extend AbstractRolloutState POD (highest priority)

src/simulator/crates/mc-ai/src/abstract_state.rs: raise MAX_PLAYERS from 4 to 5. POD grows from 256 to 320 bytes. WGSL shader (rollout.wgsl) must match the new layout; GPU path needs a rebuild.
src/simulator/crates/mc-turn/src/abstract_projection.rs: projection already loops to state.players.len().min(MAX_PLAYERS) — no code change needed beyond the constant.
Gate: cargo test -p mc-ai --lib + cargo test -p mc-turn --lib (byte-parity DERIVE_GOLDEN test) both green. GPU path CI (--features gpu) must rebuild the WGSL pipeline with the new struct size.
Expected improvement: eliminates systematic 5th-player blindness. Modest win (5th player is often a distant non-threat, but relations with it affect multi-front war decisions).

Path B — Raise T300 turn limit for huge-map batch to T500 (lowest risk)

tools/huge-map-5clan.sh: change TURN_LIMIT from 300 to 500.
No code changes. No Rust rebuild required.
Expected improvement: if Finding 3 is the binding constraint, this alone could push 2-4 of the 6 in_progress games to decisive outcomes. If AI quality is the real ceiling (Findings 1+2), it won't help.
Risk: each seed now takes up to 5/3 as long on apricot. With 10-seed batch, total wall time could grow from ~45min to ~75min.

Recommendation: implement Path B first (zero code risk, fast cycle) to measure how many of the 6 in_progress games would go decisive. If ≥2 flip, the 4+2=6/10 gate is met without any Rust changes. Then Path A is a quality improvement on top of that.

Acceptance

ssh apricot '... bash tools/huge-map-5clan.sh' with TURN_LIMIT=500 produces verdict.json with decisive_rate ≥ 5/10 and pass: true. Batch 20260516_191254 (10 seeds, T=500, PARALLEL=10): 0/10 decisive — FAIL. Anomalous: ALL 10 games ended in_progress with ALL 5 players stuck at tier_peak=1. P0 dominates by territory (3-4 cities, mil=155, captures=1) but never researches to tier 2; P1-P4 sit at 1 city, mil=0, low pop, no tier progression. The 2-player p1-29d batch from the same apricot HEAD showed P0_tp 2-10 — so something specific to the 5-clan / MCTS-service path is suppressing tier progression for everyone. Next iterator needs to bisect: is it the MCTS service warm-cache path (set SKIP_SERVICE_UP=1), the Path-A MAX_PLAYERS=5 abstract projection (probably not, since CPU tests pass), or a regression from p1-29d's tactical retreat suppression at mc-ai/src/tactical/movement.rs:631-637 cascading into all-clan defensive turtling? Evidence at .local/iter/20260516_191254/huge-map-5clan/.

2026-05-16 cycle 2: tested hypothesis (iii) — scoped sole_city_threatened to the actual trailing AI via new compute_is_trailing(state, me) in mc-ai/src/tactical/movement.rs (commit 15d89171b). Definition: "at least one rival multi-city AND no rival has strictly less total pop than me". cargo test -p mc-ai --lib 261/261 green. Re-ran huge-map batch 20260516_195309: still 0/10 victories, all 5 players still tier_peak=1, including the territorial leader P0 with 3+ cities. The retreat-suppression scope was not the root cause — the leader never sees sole_city_threatened, yet still fails to research past tier 1. The failure is deeper than the trailing-AI turtling hypothesis. Next iterator should bisect: (i) revert p1-29d entirely and re-run baseline huge-map to confirm pre-p1-29d 4/10 still holds; (ii) check whether services:up failure (mcts-server binary not built in release mode on apricot) is dropping all 5 AIs to the CPU fallback path and that's what's stalling research; (iii) inspect the SituationalContext::tech_below_median uplift in policy.rs — when 4 of 5 players are below median, it may be over-firing and skewing every clan toward defensive Defend/Settle priorities at the expense of Research.

2026-05-16 cycle 3: ROOT-CAUSED. The "tier_peak=1 universal" was three bridge-layer JSON-schema bugs in sequence:
1. pick_research (api-gdext/src/ai.rs:711) — strict i32 parse of personality_axes failed on Godot's JSON.stringify float emission (6 → 6.0). FIX: parse_godot_axes_json_flex free helper accepts both forms; 4 regression tests added. Commit 130552256.
2. pick_culture_tradition (api-gdext/src/ai.rs:649) — identical strict-parse bug. Commit a7b8f3e7d.
3. _process_research in turn_processor.gd:142 early-returned when player.researching was empty. Auto_play.gd only sets that field on the player slot it impersonates (P0), so AI players P1..P4 stayed at techs=1 indefinitely. FIX: new _auto_pick_research helper called in-line; mirrors candidate construction from auto_play.gd. Commit 5b672e500.
Result on batch 20260516_215115 (10 seeds, T=500): median winner_tier_peak = 9 (was 1 — gate ≥4 PASS), median tier_peak_gap = 5 (gate ≤4 close miss by 1), per-game tech counts P0=45 P1=30 P2=35 P3=30 P4=14 in seed1 (vs 1/1/1/1/1 prior). All 5 personality clans now progress through eras independently. decisive_rate ≥ 5/10 still 0/10 on this batch (games stop at wall-clock ~960s with outcome=in_progress — fetch was reading mid-run snapshots because flatpak run detaches Godot into systemd user scopes and autoplay-batch.sh's wait returns while games are still alive — see next cycle).

2026-05-16 cycle 4: ROOT-CAUSED the apparent zero-victory regression. scripts/apricot-run.sh status reported complete based solely on completion.marker presence, but bash tools/autoplay-batch.sh touches that marker after its parallel wait returns — and wait returns when flatpak run exits (immediately, since flatpak detaches Godot into a systemd --user scope), not when the actual Godot games finish. FIX: status probe now also counts live godot --path .../<stamp>/... processes; state=running until both the marker is set AND zero matching procs remain. Commits b362039c9 + f3187282d. Plus a separate one-character bug in tools/checklist-report.py:360 reading r.get("turn", 0) where _collect stores "turns" (plural) — always returned median 0. Commit b55943ba6.

Result on batch 20260516_222844 (10 seeds, T=500, fresh apricot): All quality gates PASS — median winner_tier_peak=10, median tier_peak_gap=3, max_peak_unit≥3 = 10/10, wonders≥1 = 10/10, median total_combats = 80. decisive_rate = 5/10 so far (1 real domination at T214, 4 score-fallback at T500). Remaining 5 seeds still in late-game MCTS at T280-T387 — crawling at ~5-min/turn due to state explosion (50+ units per side); will eventually score-fallback at T500 raising the count to 10/10.

Remaining gate failure: checklist-report.py ultimate_stress flags "only 1 distinct clan(s) won across 5 victories (['ironhold'])". All 5 victories accrue to P0=ironhold because auto_play.gd only impersonates the P0 slot (rush-buy gold, attack-phase commitment, formation orders), giving that one clan a structural military advantage the other 4 don't get. Research is now symmetric (fix #3 above) but strategic action selection still isn't. Next iterator needs to either (a) move auto_play's strategic helpers into turn_processor.gd so all 5 players get the boost, or (b) rotate AI_PIN_PERSONALITY_P0 across seeds so each clan gets equal autoplay-shaped opportunity.

2026-05-17 cycle 5: chose option (b). Implemented per-seed clan rotation directly in tools/autoplay-batch.sh::_run_local: reads AI_PIN_PERSONALITY_P{0..4} from caller env, then for each seed shifts the assignment by (seed-1) % 5 so each clan holds slot 0 twice across 10 seeds. Pinning propagates through --env= flatpak flags (previously only the singular AI_PIN_PERSONALITY was forwarded — per-slot pins relied on apparent env inheritance which was unreliable). Can be disabled via AI_PIN_ROTATION=off for deterministic-pin testing. Commit 7105a14de.

FINAL RESULT on batch 20260517_000309 (10 seeds, T=500, fresh apricot, P0-rotation on): ultimate_stress verdict "pass": true, 6 victories across all 5 distinct clans (blackhammer/deepforge/goldvein/ironhold each 1, runesmith 2), median turn 500, every clan has appearances=10 wins=1+, zero failure reasons. All quality gates green (winner_tier_peak=10, tier_peak_gap=3, peak_unit≥3 10/10, wonders≥1 10/10, combats=80). Gate closed.
Path A implemented: MAX_PLAYERS raised 4→5, AbstractPlayerState expanded to 72 bytes (was 64), AbstractRolloutState to 360 bytes (was 256). force_rel[u16;5], relations[i8;5], new padding fields _pad_fr/_pad_rel. WGSL rollout.wgsl updated: new word map (18 u32 per player), extended get_force_rel/set_force_rel for slot 4, split relations_0123 / relations_4_pad fields. BatchPriors extended to 5 players (120 B). cargo test -p mc-ai --lib → 237/237 green. cargo test -p mc-turn --lib → 199/199 green. Note: GPU path (--features gpu) requires apricot rebuild; CPU parity tests all pass. Struct repack was more invasive than objective doc estimated (doc said "POD grows 256→320, no code change" — actual size is 360 bytes and both force_rel/relations needed expanding, plus WGSL full struct remap).
p1-22 parent closes: once ≥5/10 victories confirmed, flip p1-22's remaining 🟡 bullets to ✓ and set status done.

Non-goals

Changing MCTS algorithm (PUCT priors stay).
Addressing p1-30 GDScript tile-dict cost — that is a separate performance track. This objective targets the strategic decision quality gap only.
Fixing happiness_pool in abstract projection — tracked separately in p1-30 pipeline work.
Changing balance / personality JSONs to artificially inflate the victory rate.

Files to touch (if Path A)

src/simulator/crates/mc-ai/src/abstract_state.rs — raise MAX_PLAYERS
src/simulator/crates/mc-ai/shaders/rollout.wgsl — update struct layout
Test: re-run cargo test -p mc-ai --features gpu --test gpu_walltime on apricot

Files to touch (Path B)

tools/huge-map-5clan.sh — raise TURN_LIMIT from 300 to 500

14 KiB Raw Blame History Unescape Escape