diff --git a/.project/objectives/p1-29f-learned-controller-bridge.md b/.project/objectives/p1-29f-learned-controller-bridge.md new file mode 100644 index 00000000..616fc056 --- /dev/null +++ b/.project/objectives/p1-29f-learned-controller-bridge.md @@ -0,0 +1,83 @@ +--- +id: p1-29f +title: "learned:* controller bridge — make the trained RL policy playable in-engine" +priority: p1 +status: missing +scope: game1 +owner: simulator-infra +tags: [ai, rl, controller, infra] +updated_at: 2026-05-27 +relates_to: [p1-29e, p2-67] +blocked_by: [] +--- + +## Context + +The pluggable controller seam exists and is exercised every batch +(`mc_player_api::controllers` registry, `DEFAULT_CONTROLLER_ID = +"scripted:default"`, `GdAiController` bridge, `mod_loader.gd` boot scan, +`registered_controller_ids()` for the game-setup UI). But the **only** +controllers ever registered into the engine dispatch path are +`scripted:default` (in-box) and WASM/native mod controllers via +`mc-mod-host`. The `learned:*` controller is documented as "Stage 6" in +`controllers.rs` docstrings and **does not exist** as code. + +Meanwhile a strong trained policy already exists, offline only: +`tooling/rl_self_play/models/duel-v4-encfix-s7/best_model.zip` — a +Stable-Baselines3 MaskablePPO net, **+5.5 mean reward vs the scripted MCTS +baseline** (per p1-29e). It runs in the Python RL loop and the divergence +miner, never in a real game. Every autoplay batch and the shipping game can +therefore only run the scripted AI. There is no way today to run +trained-vs-scripted or trained-vs-trained. + +This is the seam p1-29e's knowledge-transfer approach worked *around*: rather +than play the trained policy, p1-29e mined it and hand-ported one insight +into the scripted controller. To actually *use* the trained AI — as a quality +benchmark, a difficulty tier, or a shipping opponent — the policy must become +a registered `learned:*` `AiController`. + +## Acceptance + +- [ ] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the + SB3 `.zip` to a form the engine can evaluate without a Python runtime + (recommended: ONNX of the policy network, loaded by `mc-ai` inference; or a + documented sidecar process if in-proc inference is rejected). The encoder + (`tooling/rl_self_play/encoders.py`: `end_turn` + per-unit micro + per-city + `queue_production` over a 16-item roster, **no research action** per p1-29e + F1) must be reimplemented or shared so the in-engine observation matches + training exactly — a `PlayerView` → obs-tensor mapping with a parity test + against the Python encoder. +- [ ] **`AiController` impl + registration.** A controller keyed + `learned:duel-v4-encfix-s7` implementing the `AiController` trait, registered + via `register_controller` at engine init (alongside `scripted:default`), so + it appears in `registered_ids()` / `GdGameState.registered_controller_ids()`. +- [ ] **Per-slot selection from autoplay.** An env hook (e.g. + `AI_CONTROLLER_P0` / `AI_CONTROLLER_P1`, or a generalised `AI_CONTROLLERS` + list) that populates `game_settings.ai_controllers` so a batch can assign a + learned controller to any slot. Default unchanged (`scripted:default`). +- [ ] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine + learned controller produces the same action distribution (argmax + top-k) as + the Python policy on the same observation. Headless GUT-compatible or a Rust + unit test against a recorded fixture. +- [ ] **Smoke run.** One trained-vs-scripted autoplay game completes headless + with `learned:duel-v4-encfix-s7` in a slot and emits valid `turn_stats.jsonl` + (proves the dispatch path works end-to-end — not a quality claim). + +## Non-goals + +- Quality re-verification of Game-1 gates against the trained AI — that is + **p1-29g** (depends on this). +- Training new policies / changing the RL loop — owned by the RL track. +- Shipping a learned opponent as the Game-1 default — Game 1 ships the 5 + scripted clan personalities; a learned tier is a later product decision. + +## Notes + +- The trained policy has **no research action** (p1-29e F1), so a + `learned:*`-driven slot will pursue tech only as a side-effect of whatever + the engine auto-resolves for research — relevant when interpreting p1-29g + `tier_peak` results. +- Builds on p2-67 (Claude-driven player API) which already proved a + programmatic, one-action-at-a-time player can drive a real game; that is a + *different* external-process surface, but the action-application plumbing may + be reusable.