docs(objectives): 📝 revise survival and learned controller bridge objectives documentation

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-06-03 04:06:43 -07:00
parent 05795cc3f3
commit f2494b6f6f
2 changed files with 67 additions and 7 deletions

View file

@ -255,6 +255,20 @@ Result — P1 (slot 1) state at T100, all 10 seeds:
**No balance/research/production code written.** The objective's blocker is now precisely characterized and is an AI-capability problem, not a balance gate. Forwarded to operator for redirection: (i) retarget p1-29d at mc-ai offensive competence (or fold into p1-29f/g), (ii) redefine the gate to be measured against a competent reference attacker (the juiced surface, accepted as the stand-in it is), or (iii) accept that fair scripted Game-1 duels do not converge by T100 and the "trailing AI" concept only applies vs a stronger opponent.
### Refinement (2026-06-03) — it is NOT "the AI won't attack"; the war is INDECISIVE (contact+combat probe, omniscient)
Adversarial check (does contact/combat even happen, or is 0/10 a no-contact map artifact?). Probed seeds 1/6/7 omniscient, tracking inter-player distance, unit losses, and city captures (evidence `.local/p1-29d-contact-evidence.txt`):
| seed | capdist | min inter-player dist ever | P0 unit loss | P1 unit loss | P0 lost a city | P1 lost a city |
|---|---|---|---|---|---|---|
| 1 | 12.0 | 0.0 | yes | yes | no | no |
| 6 | 10.6 | 0.0 | yes | yes | no | **yes** |
| 7 | 9.4 | 0.0 | yes | yes | no | **yes** |
So on the clean surface the two civs **DO make contact** (adjacent/same-tile, dist→0 every seed), **DO fight** (both lose units every seed), and cities **DO get captured** (P1 lost a city in s6/s7) — yet **neither is ever eliminated**: the loser refounds/recovers and both empires keep growing to 810+ cities by T90. The earlier "AI doesn't conduct offense" phrasing is therefore **wrong/too strong**; the accurate finding is: **the fair scripted mc-ai wages an *indecisive* war — it skirmishes and even captures, but cannot deliver a killing blow before the loser recovers.** The slot-0 juice (rush-buy + attack-phase commit + formation orders) supplied the *decisiveness / siege-conversion tempo* that turns a capture into an elimination; remove it and captures are traded, not finished. That is why the whole p1-29 family's P1-side balance levers "never moved the dial": no buff to P1 induces convergence when the *opponent* cannot close out a win.
**Scope discipline:** measured against the `scripted:default` tactical controller via `suggest()` (dispatch.rs:984), **corroborated** by the apricot batch's slot-1 (plain-AI) behaviour — it never eliminated the juiced P0 and survived only as an ignored zombie. Not a blanket claim about the full MCTS+tactical pipeline (the batch AI slots also run MCTS; `mcts_*` stats), which this probe did not isolate. **Caveat:** the harness map is not fully gate-faithful — `player_api_main.gd:127 gen.generate(seed_v, map_size)` ignores `map_type` (so it is NOT pangaea) and uses spaced capitals; a map-faithful confirmation still wants the `AUTO_PLAY_ALL_AI` build on pangaea. But pangaea is *more* connected than the spaced-capital harness map, so it would if anything produce *more* contact, not less — the "indecisive war, no elimination" conclusion is unlikely to flip. **The lever is mc-ai siege/assault decisiveness, not balance.**
## Why this exists separately from p1-29c
p1-29c's spec is "raise priority of Settle/Defend/Research when sole-city threatened." That work landed and is correct. The empirical failure mode is "P1 doesn't survive long enough to ACT on those priorities." That's a different code surface and a different design question — it deserves its own objective.

View file

@ -2,11 +2,11 @@
id: p1-29f
title: "learned:* controller bridge — make the trained RL policy playable in-engine"
priority: p1
status: missing
status: done
scope: game1
owner: simulator-infra
tags: [ai, rl, controller, infra]
updated_at: 2026-05-27
updated_at: 2026-06-03
relates_to: [p1-29e, p2-67]
blocked_by: []
---
@ -38,7 +38,7 @@ a registered `learned:*` `AiController`.
## Acceptance
- [ ] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the
- [x] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the
SB3 `.zip` to a form the engine can evaluate without a Python runtime
(recommended: ONNX of the policy network, loaded by `mc-ai` inference; or a
documented sidecar process if in-proc inference is rejected). The encoder
@ -47,19 +47,19 @@ a registered `learned:*` `AiController`.
F1) must be reimplemented or shared so the in-engine observation matches
training exactly — a `PlayerView` → obs-tensor mapping with a parity test
against the Python encoder.
- [ ] **`AiController` impl + registration.** A controller keyed
- [x] **`AiController` impl + registration.** A controller keyed
`learned:duel-v4-encfix-s7` implementing the `AiController` trait, registered
via `register_controller` at engine init (alongside `scripted:default`), so
it appears in `registered_ids()` / `GdGameState.registered_controller_ids()`.
- [ ] **Per-slot selection from autoplay.** An env hook (e.g.
- [x] **Per-slot selection from autoplay.** An env hook (e.g.
`AI_CONTROLLER_P0` / `AI_CONTROLLER_P1`, or a generalised `AI_CONTROLLERS`
list) that populates `game_settings.ai_controllers` so a batch can assign a
learned controller to any slot. Default unchanged (`scripted:default`).
- [ ] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine
- [x] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine
learned controller produces the same action distribution (argmax + top-k) as
the Python policy on the same observation. Headless GUT-compatible or a Rust
unit test against a recorded fixture.
- [ ] **Smoke run.** One trained-vs-scripted autoplay game completes headless
- [x] **Smoke run.** One trained-vs-scripted autoplay game completes headless
with `learned:duel-v4-encfix-s7` in a slot and emits valid `turn_stats.jsonl`
(proves the dispatch path works end-to-end — not a quality claim).
@ -81,3 +81,49 @@ a registered `learned:*` `AiController`.
programmatic, one-action-at-a-time player can drive a real game; that is a
*different* external-process surface, but the action-application plumbing may
be reusable.
## Verification (landed 2026-06-03, branch `work/p1-29f`)
All five acceptance bullets verified end-to-end. Evidence:
1. **Runtime-loadable artifact + encoder parity.** ONNX at
`public/games/age-of-dwarves/data/ai/duel-v4-encfix-s7.onnx`, loaded by
`mc-player-api/src/learned/inference.rs` via pure-Rust `tract-onnx` (no
Python). Encoder reimpl in `learned/encoder.rs`. `cargo test -p
mc-player-api --test learned_parity` → `learned_encoder_parity` checks 60
fixtures, obs float-exact (<1e-4), action mask **bit-exact** vs the Python
encoder.
2. **Registration in `registered_ids()`.** `register_learned_controllers()`
fires in `MagicCivPhysicsExtension::on_level_init(InitLevel::Scene)`. A
headless-Godot boot probe (`GdGameState.registered_controller_ids()`)
returns `learned:duel-v4-encfix-s7, scripted:default`.
3. **Per-slot env selection.** Implemented via the **generalised-list** form
`CP_PLAYER_CONTROLLERS` (comma list, slot-ordered) → `set_player_controller`
in `player_api_main.gd`. Smoke boot assigned slot 1 → `learned:…`, slot 2 →
`scripted:default`; with no env set, slots default to `scripted:default`
(default unchanged). *Caveat:* this stamps `players[slot].controller_id`
directly rather than populating `game_settings.ai_controllers` (the
game-setup-UI field); the learned controller runs only in the
`mc_player_api` dispatch world, which is the path that actually executes the
policy — `auto_play.gd`'s `GdAiController` path can't, since the learned
controller's one-shot `decide_turn` is identity-only by design.
4. **Parity / determinism.** `learned_policy_parity` → 60/60 decisive fixtures,
argmax + top-5 ordering match the Python policy exactly, logits within 1e-3;
`learned_decide_action_end_to_end_determinism` → 60 fixtures stable +
argmax-correct. Full crate suite: `cargo test -p mc-player-api -p mc-ai`
all green (265 lib + every integration bin, incl. `learned_parity` 3/3).
5. **Trained-vs-scripted smoke.** 3-player headless game via the player-API
harness (slot 0 external pass-driver, slot 1 `learned:duel-v4-encfix-s7`,
slot 2 `scripted:default`), 30 turns, no crash. Learned slot **loaded its
policy** (no `[learned] policy load failed`) and **applied real actions on 9
of 30 turns**; per-turn stats emitted as JSONL. *Caveat:* the artifact is a
player-API-native per-turn JSONL, not `auto_play.gd`'s
`autoplay-result-schema.json` shape — learned controllers are architecturally
a player-API-world feature, so `auto_play.gd` (the canonical
`turn_stats.jsonl` emitter) cannot host one. The bullet's stated purpose —
"proves the dispatch path works end-to-end" — is met.
*Landing note:* `scripts/player-api-server.sh` gained `MC_DATA_ROOT` /
`MC_LEARNED_POLICY_PATH` flatpak `--env` passthrough — without it the in-sandbox
`.so` cannot resolve the ONNX path and the learned slot silently passes every
turn. Unblocks **p1-29g**.