docs(objectives): 📝 Update learned controller bridge documentation with technical details, design, implementation, and use cases

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-05-27 23:07:16 -07:00
parent 5b2cf24293
commit 91346a7ae9

View file

@ -0,0 +1,83 @@
---
id: p1-29f
title: "learned:* controller bridge — make the trained RL policy playable in-engine"
priority: p1
status: missing
scope: game1
owner: simulator-infra
tags: [ai, rl, controller, infra]
updated_at: 2026-05-27
relates_to: [p1-29e, p2-67]
blocked_by: []
---
## Context
The pluggable controller seam exists and is exercised every batch
(`mc_player_api::controllers` registry, `DEFAULT_CONTROLLER_ID =
"scripted:default"`, `GdAiController` bridge, `mod_loader.gd` boot scan,
`registered_controller_ids()` for the game-setup UI). But the **only**
controllers ever registered into the engine dispatch path are
`scripted:default` (in-box) and WASM/native mod controllers via
`mc-mod-host`. The `learned:*` controller is documented as "Stage 6" in
`controllers.rs` docstrings and **does not exist** as code.
Meanwhile a strong trained policy already exists, offline only:
`tooling/rl_self_play/models/duel-v4-encfix-s7/best_model.zip` — a
Stable-Baselines3 MaskablePPO net, **+5.5 mean reward vs the scripted MCTS
baseline** (per p1-29e). It runs in the Python RL loop and the divergence
miner, never in a real game. Every autoplay batch and the shipping game can
therefore only run the scripted AI. There is no way today to run
trained-vs-scripted or trained-vs-trained.
This is the seam p1-29e's knowledge-transfer approach worked *around*: rather
than play the trained policy, p1-29e mined it and hand-ported one insight
into the scripted controller. To actually *use* the trained AI — as a quality
benchmark, a difficulty tier, or a shipping opponent — the policy must become
a registered `learned:*` `AiController`.
## Acceptance
- [ ] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the
SB3 `.zip` to a form the engine can evaluate without a Python runtime
(recommended: ONNX of the policy network, loaded by `mc-ai` inference; or a
documented sidecar process if in-proc inference is rejected). The encoder
(`tooling/rl_self_play/encoders.py`: `end_turn` + per-unit micro + per-city
`queue_production` over a 16-item roster, **no research action** per p1-29e
F1) must be reimplemented or shared so the in-engine observation matches
training exactly — a `PlayerView` → obs-tensor mapping with a parity test
against the Python encoder.
- [ ] **`AiController` impl + registration.** A controller keyed
`learned:duel-v4-encfix-s7` implementing the `AiController` trait, registered
via `register_controller` at engine init (alongside `scripted:default`), so
it appears in `registered_ids()` / `GdGameState.registered_controller_ids()`.
- [ ] **Per-slot selection from autoplay.** An env hook (e.g.
`AI_CONTROLLER_P0` / `AI_CONTROLLER_P1`, or a generalised `AI_CONTROLLERS`
list) that populates `game_settings.ai_controllers` so a batch can assign a
learned controller to any slot. Default unchanged (`scripted:default`).
- [ ] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine
learned controller produces the same action distribution (argmax + top-k) as
the Python policy on the same observation. Headless GUT-compatible or a Rust
unit test against a recorded fixture.
- [ ] **Smoke run.** One trained-vs-scripted autoplay game completes headless
with `learned:duel-v4-encfix-s7` in a slot and emits valid `turn_stats.jsonl`
(proves the dispatch path works end-to-end — not a quality claim).
## Non-goals
- Quality re-verification of Game-1 gates against the trained AI — that is
**p1-29g** (depends on this).
- Training new policies / changing the RL loop — owned by the RL track.
- Shipping a learned opponent as the Game-1 default — Game 1 ships the 5
scripted clan personalities; a learned tier is a later product decision.
## Notes
- The trained policy has **no research action** (p1-29e F1), so a
`learned:*`-driven slot will pursue tech only as a side-effect of whatever
the engine auto-resolves for research — relevant when interpreting p1-29g
`tier_peak` results.
- Builds on p2-67 (Claude-driven player API) which already proved a
programmatic, one-action-at-a-time player can drive a real game; that is a
*different* external-process surface, but the action-application plumbing may
be reusable.