docs(objectives): 📝 Update learned controller bridge documentation with technical details, design, implementation, and use cases

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-05-27 23:07:16 -07:00 · 2026-05-27 23:07:16 -07:00 · 91346a7ae9
commit 91346a7ae9
parent 5b2cf24293
1 changed files with 83 additions and 0 deletions
--- a/.project/objectives/p1-29f-learned-controller-bridge.md
+++ b/.project/objectives/p1-29f-learned-controller-bridge.md
@ -0,0 +1,83 @@
+---
+id: p1-29f
+title: "learned:* controller bridge — make the trained RL policy playable in-engine"
+priority: p1
+status: missing
+scope: game1
+owner: simulator-infra
+tags: [ai, rl, controller, infra]
+updated_at: 2026-05-27
+relates_to: [p1-29e, p2-67]
+blocked_by: []
+---
+
+## Context
+
+The pluggable controller seam exists and is exercised every batch
+(`mc_player_api::controllers` registry, `DEFAULT_CONTROLLER_ID =
+"scripted:default"`, `GdAiController` bridge, `mod_loader.gd` boot scan,
+`registered_controller_ids()` for the game-setup UI). But the **only**
+controllers ever registered into the engine dispatch path are
+`scripted:default` (in-box) and WASM/native mod controllers via
+`mc-mod-host`. The `learned:*` controller is documented as "Stage 6" in
+`controllers.rs` docstrings and **does not exist** as code.
+
+Meanwhile a strong trained policy already exists, offline only:
+`tooling/rl_self_play/models/duel-v4-encfix-s7/best_model.zip` — a
+Stable-Baselines3 MaskablePPO net, **+5.5 mean reward vs the scripted MCTS
+baseline** (per p1-29e). It runs in the Python RL loop and the divergence
+miner, never in a real game. Every autoplay batch and the shipping game can
+therefore only run the scripted AI. There is no way today to run
+trained-vs-scripted or trained-vs-trained.
+
+This is the seam p1-29e's knowledge-transfer approach worked *around*: rather
+than play the trained policy, p1-29e mined it and hand-ported one insight
+into the scripted controller. To actually *use* the trained AI — as a quality
+benchmark, a difficulty tier, or a shipping opponent — the policy must become
+a registered `learned:*` `AiController`.
+
+## Acceptance
+
+- [ ] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the
+  SB3 `.zip` to a form the engine can evaluate without a Python runtime
+  (recommended: ONNX of the policy network, loaded by `mc-ai` inference; or a
+  documented sidecar process if in-proc inference is rejected). The encoder
+  (`tooling/rl_self_play/encoders.py`: `end_turn` + per-unit micro + per-city
+  `queue_production` over a 16-item roster, **no research action** per p1-29e
+  F1) must be reimplemented or shared so the in-engine observation matches
+  training exactly — a `PlayerView` → obs-tensor mapping with a parity test
+  against the Python encoder.
+- [ ] **`AiController` impl + registration.** A controller keyed
+  `learned:duel-v4-encfix-s7` implementing the `AiController` trait, registered
+  via `register_controller` at engine init (alongside `scripted:default`), so
+  it appears in `registered_ids()` / `GdGameState.registered_controller_ids()`.
+- [ ] **Per-slot selection from autoplay.** An env hook (e.g.
+  `AI_CONTROLLER_P0` / `AI_CONTROLLER_P1`, or a generalised `AI_CONTROLLERS`
+  list) that populates `game_settings.ai_controllers` so a batch can assign a
+  learned controller to any slot. Default unchanged (`scripted:default`).
+- [ ] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine
+  learned controller produces the same action distribution (argmax + top-k) as
+  the Python policy on the same observation. Headless GUT-compatible or a Rust
+  unit test against a recorded fixture.
+- [ ] **Smoke run.** One trained-vs-scripted autoplay game completes headless
+  with `learned:duel-v4-encfix-s7` in a slot and emits valid `turn_stats.jsonl`
+  (proves the dispatch path works end-to-end — not a quality claim).
+
+## Non-goals
+
+- Quality re-verification of Game-1 gates against the trained AI — that is
+  **p1-29g** (depends on this).
+- Training new policies / changing the RL loop — owned by the RL track.
+- Shipping a learned opponent as the Game-1 default — Game 1 ships the 5
+  scripted clan personalities; a learned tier is a later product decision.
+
+## Notes
+
+- The trained policy has **no research action** (p1-29e F1), so a
+  `learned:*`-driven slot will pursue tech only as a side-effect of whatever
+  the engine auto-resolves for research — relevant when interpreting p1-29g
+  `tier_peak` results.
+- Builds on p2-67 (Claude-driven player API) which already proved a
+  programmatic, one-action-at-a-time player can drive a real game; that is a
+  *different* external-process surface, but the action-application plumbing may
+  be reusable.