diff --git a/.project/objectives/p1-29d-p1-survival.md b/.project/objectives/p1-29d-p1-survival.md index d941bef4..1bb42161 100644 --- a/.project/objectives/p1-29d-p1-survival.md +++ b/.project/objectives/p1-29d-p1-survival.md @@ -255,6 +255,20 @@ Result — P1 (slot 1) state at T100, all 10 seeds: **No balance/research/production code written.** The objective's blocker is now precisely characterized and is an AI-capability problem, not a balance gate. Forwarded to operator for redirection: (i) retarget p1-29d at mc-ai offensive competence (or fold into p1-29f/g), (ii) redefine the gate to be measured against a competent reference attacker (the juiced surface, accepted as the stand-in it is), or (iii) accept that fair scripted Game-1 duels do not converge by T100 and the "trailing AI" concept only applies vs a stronger opponent. +### Refinement (2026-06-03) — it is NOT "the AI won't attack"; the war is INDECISIVE (contact+combat probe, omniscient) + +Adversarial check (does contact/combat even happen, or is 0/10 a no-contact map artifact?). Probed seeds 1/6/7 omniscient, tracking inter-player distance, unit losses, and city captures (evidence `.local/p1-29d-contact-evidence.txt`): + +| seed | capdist | min inter-player dist ever | P0 unit loss | P1 unit loss | P0 lost a city | P1 lost a city | +|---|---|---|---|---|---|---| +| 1 | 12.0 | 0.0 | yes | yes | no | no | +| 6 | 10.6 | 0.0 | yes | yes | no | **yes** | +| 7 | 9.4 | 0.0 | yes | yes | no | **yes** | + +So on the clean surface the two civs **DO make contact** (adjacent/same-tile, dist→0 every seed), **DO fight** (both lose units every seed), and cities **DO get captured** (P1 lost a city in s6/s7) — yet **neither is ever eliminated**: the loser refounds/recovers and both empires keep growing to 8–10+ cities by T90. The earlier "AI doesn't conduct offense" phrasing is therefore **wrong/too strong**; the accurate finding is: **the fair scripted mc-ai wages an *indecisive* war — it skirmishes and even captures, but cannot deliver a killing blow before the loser recovers.** The slot-0 juice (rush-buy + attack-phase commit + formation orders) supplied the *decisiveness / siege-conversion tempo* that turns a capture into an elimination; remove it and captures are traded, not finished. That is why the whole p1-29 family's P1-side balance levers "never moved the dial": no buff to P1 induces convergence when the *opponent* cannot close out a win. + +**Scope discipline:** measured against the `scripted:default` tactical controller via `suggest()` (dispatch.rs:984), **corroborated** by the apricot batch's slot-1 (plain-AI) behaviour — it never eliminated the juiced P0 and survived only as an ignored zombie. Not a blanket claim about the full MCTS+tactical pipeline (the batch AI slots also run MCTS; `mcts_*` stats), which this probe did not isolate. **Caveat:** the harness map is not fully gate-faithful — `player_api_main.gd:127 gen.generate(seed_v, map_size)` ignores `map_type` (so it is NOT pangaea) and uses spaced capitals; a map-faithful confirmation still wants the `AUTO_PLAY_ALL_AI` build on pangaea. But pangaea is *more* connected than the spaced-capital harness map, so it would if anything produce *more* contact, not less — the "indecisive war, no elimination" conclusion is unlikely to flip. **The lever is mc-ai siege/assault decisiveness, not balance.** + ## Why this exists separately from p1-29c p1-29c's spec is "raise priority of Settle/Defend/Research when sole-city threatened." That work landed and is correct. The empirical failure mode is "P1 doesn't survive long enough to ACT on those priorities." That's a different code surface and a different design question — it deserves its own objective. diff --git a/.project/objectives/p1-29f-learned-controller-bridge.md b/.project/objectives/p1-29f-learned-controller-bridge.md index 616fc056..5b58eb84 100644 --- a/.project/objectives/p1-29f-learned-controller-bridge.md +++ b/.project/objectives/p1-29f-learned-controller-bridge.md @@ -2,11 +2,11 @@ id: p1-29f title: "learned:* controller bridge — make the trained RL policy playable in-engine" priority: p1 -status: missing +status: done scope: game1 owner: simulator-infra tags: [ai, rl, controller, infra] -updated_at: 2026-05-27 +updated_at: 2026-06-03 relates_to: [p1-29e, p2-67] blocked_by: [] --- @@ -38,7 +38,7 @@ a registered `learned:*` `AiController`. ## Acceptance -- [ ] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the +- [x] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the SB3 `.zip` to a form the engine can evaluate without a Python runtime (recommended: ONNX of the policy network, loaded by `mc-ai` inference; or a documented sidecar process if in-proc inference is rejected). The encoder @@ -47,19 +47,19 @@ a registered `learned:*` `AiController`. F1) must be reimplemented or shared so the in-engine observation matches training exactly — a `PlayerView` → obs-tensor mapping with a parity test against the Python encoder. -- [ ] **`AiController` impl + registration.** A controller keyed +- [x] **`AiController` impl + registration.** A controller keyed `learned:duel-v4-encfix-s7` implementing the `AiController` trait, registered via `register_controller` at engine init (alongside `scripted:default`), so it appears in `registered_ids()` / `GdGameState.registered_controller_ids()`. -- [ ] **Per-slot selection from autoplay.** An env hook (e.g. +- [x] **Per-slot selection from autoplay.** An env hook (e.g. `AI_CONTROLLER_P0` / `AI_CONTROLLER_P1`, or a generalised `AI_CONTROLLERS` list) that populates `game_settings.ai_controllers` so a batch can assign a learned controller to any slot. Default unchanged (`scripted:default`). -- [ ] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine +- [x] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine learned controller produces the same action distribution (argmax + top-k) as the Python policy on the same observation. Headless GUT-compatible or a Rust unit test against a recorded fixture. -- [ ] **Smoke run.** One trained-vs-scripted autoplay game completes headless +- [x] **Smoke run.** One trained-vs-scripted autoplay game completes headless with `learned:duel-v4-encfix-s7` in a slot and emits valid `turn_stats.jsonl` (proves the dispatch path works end-to-end — not a quality claim). @@ -81,3 +81,49 @@ a registered `learned:*` `AiController`. programmatic, one-action-at-a-time player can drive a real game; that is a *different* external-process surface, but the action-application plumbing may be reusable. + +## Verification (landed 2026-06-03, branch `work/p1-29f`) + +All five acceptance bullets verified end-to-end. Evidence: + +1. **Runtime-loadable artifact + encoder parity.** ONNX at + `public/games/age-of-dwarves/data/ai/duel-v4-encfix-s7.onnx`, loaded by + `mc-player-api/src/learned/inference.rs` via pure-Rust `tract-onnx` (no + Python). Encoder reimpl in `learned/encoder.rs`. `cargo test -p + mc-player-api --test learned_parity` → `learned_encoder_parity` checks 60 + fixtures, obs float-exact (<1e-4), action mask **bit-exact** vs the Python + encoder. +2. **Registration in `registered_ids()`.** `register_learned_controllers()` + fires in `MagicCivPhysicsExtension::on_level_init(InitLevel::Scene)`. A + headless-Godot boot probe (`GdGameState.registered_controller_ids()`) + returns `learned:duel-v4-encfix-s7, scripted:default`. +3. **Per-slot env selection.** Implemented via the **generalised-list** form + `CP_PLAYER_CONTROLLERS` (comma list, slot-ordered) → `set_player_controller` + in `player_api_main.gd`. Smoke boot assigned slot 1 → `learned:…`, slot 2 → + `scripted:default`; with no env set, slots default to `scripted:default` + (default unchanged). *Caveat:* this stamps `players[slot].controller_id` + directly rather than populating `game_settings.ai_controllers` (the + game-setup-UI field); the learned controller runs only in the + `mc_player_api` dispatch world, which is the path that actually executes the + policy — `auto_play.gd`'s `GdAiController` path can't, since the learned + controller's one-shot `decide_turn` is identity-only by design. +4. **Parity / determinism.** `learned_policy_parity` → 60/60 decisive fixtures, + argmax + top-5 ordering match the Python policy exactly, logits within 1e-3; + `learned_decide_action_end_to_end_determinism` → 60 fixtures stable + + argmax-correct. Full crate suite: `cargo test -p mc-player-api -p mc-ai` + all green (265 lib + every integration bin, incl. `learned_parity` 3/3). +5. **Trained-vs-scripted smoke.** 3-player headless game via the player-API + harness (slot 0 external pass-driver, slot 1 `learned:duel-v4-encfix-s7`, + slot 2 `scripted:default`), 30 turns, no crash. Learned slot **loaded its + policy** (no `[learned] policy load failed`) and **applied real actions on 9 + of 30 turns**; per-turn stats emitted as JSONL. *Caveat:* the artifact is a + player-API-native per-turn JSONL, not `auto_play.gd`'s + `autoplay-result-schema.json` shape — learned controllers are architecturally + a player-API-world feature, so `auto_play.gd` (the canonical + `turn_stats.jsonl` emitter) cannot host one. The bullet's stated purpose — + "proves the dispatch path works end-to-end" — is met. + +*Landing note:* `scripts/player-api-server.sh` gained `MC_DATA_ROOT` / +`MC_LEARNED_POLICY_PATH` flatpak `--env` passthrough — without it the in-sandbox +`.so` cannot resolve the ONNX path and the learned slot silently passes every +turn. Unblocks **p1-29g**.