docs(objectives): 📝 Update learned controller bridge documentation with technical details, design, implementation, and use cases
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
5b2cf24293
commit
91346a7ae9
1 changed files with 83 additions and 0 deletions
83
.project/objectives/p1-29f-learned-controller-bridge.md
Normal file
83
.project/objectives/p1-29f-learned-controller-bridge.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
id: p1-29f
|
||||
title: "learned:* controller bridge — make the trained RL policy playable in-engine"
|
||||
priority: p1
|
||||
status: missing
|
||||
scope: game1
|
||||
owner: simulator-infra
|
||||
tags: [ai, rl, controller, infra]
|
||||
updated_at: 2026-05-27
|
||||
relates_to: [p1-29e, p2-67]
|
||||
blocked_by: []
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The pluggable controller seam exists and is exercised every batch
|
||||
(`mc_player_api::controllers` registry, `DEFAULT_CONTROLLER_ID =
|
||||
"scripted:default"`, `GdAiController` bridge, `mod_loader.gd` boot scan,
|
||||
`registered_controller_ids()` for the game-setup UI). But the **only**
|
||||
controllers ever registered into the engine dispatch path are
|
||||
`scripted:default` (in-box) and WASM/native mod controllers via
|
||||
`mc-mod-host`. The `learned:*` controller is documented as "Stage 6" in
|
||||
`controllers.rs` docstrings and **does not exist** as code.
|
||||
|
||||
Meanwhile a strong trained policy already exists, offline only:
|
||||
`tooling/rl_self_play/models/duel-v4-encfix-s7/best_model.zip` — a
|
||||
Stable-Baselines3 MaskablePPO net, **+5.5 mean reward vs the scripted MCTS
|
||||
baseline** (per p1-29e). It runs in the Python RL loop and the divergence
|
||||
miner, never in a real game. Every autoplay batch and the shipping game can
|
||||
therefore only run the scripted AI. There is no way today to run
|
||||
trained-vs-scripted or trained-vs-trained.
|
||||
|
||||
This is the seam p1-29e's knowledge-transfer approach worked *around*: rather
|
||||
than play the trained policy, p1-29e mined it and hand-ported one insight
|
||||
into the scripted controller. To actually *use* the trained AI — as a quality
|
||||
benchmark, a difficulty tier, or a shipping opponent — the policy must become
|
||||
a registered `learned:*` `AiController`.
|
||||
|
||||
## Acceptance
|
||||
|
||||
- [ ] **Runtime-loadable policy artifact.** Export `duel-v4-encfix-s7` from the
|
||||
SB3 `.zip` to a form the engine can evaluate without a Python runtime
|
||||
(recommended: ONNX of the policy network, loaded by `mc-ai` inference; or a
|
||||
documented sidecar process if in-proc inference is rejected). The encoder
|
||||
(`tooling/rl_self_play/encoders.py`: `end_turn` + per-unit micro + per-city
|
||||
`queue_production` over a 16-item roster, **no research action** per p1-29e
|
||||
F1) must be reimplemented or shared so the in-engine observation matches
|
||||
training exactly — a `PlayerView` → obs-tensor mapping with a parity test
|
||||
against the Python encoder.
|
||||
- [ ] **`AiController` impl + registration.** A controller keyed
|
||||
`learned:duel-v4-encfix-s7` implementing the `AiController` trait, registered
|
||||
via `register_controller` at engine init (alongside `scripted:default`), so
|
||||
it appears in `registered_ids()` / `GdGameState.registered_controller_ids()`.
|
||||
- [ ] **Per-slot selection from autoplay.** An env hook (e.g.
|
||||
`AI_CONTROLLER_P0` / `AI_CONTROLLER_P1`, or a generalised `AI_CONTROLLERS`
|
||||
list) that populates `game_settings.ai_controllers` so a batch can assign a
|
||||
learned controller to any slot. Default unchanged (`scripted:default`).
|
||||
- [ ] **Parity / determinism test.** Given a fixed `PlayerView`, the in-engine
|
||||
learned controller produces the same action distribution (argmax + top-k) as
|
||||
the Python policy on the same observation. Headless GUT-compatible or a Rust
|
||||
unit test against a recorded fixture.
|
||||
- [ ] **Smoke run.** One trained-vs-scripted autoplay game completes headless
|
||||
with `learned:duel-v4-encfix-s7` in a slot and emits valid `turn_stats.jsonl`
|
||||
(proves the dispatch path works end-to-end — not a quality claim).
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Quality re-verification of Game-1 gates against the trained AI — that is
|
||||
**p1-29g** (depends on this).
|
||||
- Training new policies / changing the RL loop — owned by the RL track.
|
||||
- Shipping a learned opponent as the Game-1 default — Game 1 ships the 5
|
||||
scripted clan personalities; a learned tier is a later product decision.
|
||||
|
||||
## Notes
|
||||
|
||||
- The trained policy has **no research action** (p1-29e F1), so a
|
||||
`learned:*`-driven slot will pursue tech only as a side-effect of whatever
|
||||
the engine auto-resolves for research — relevant when interpreting p1-29g
|
||||
`tier_peak` results.
|
||||
- Builds on p2-67 (Claude-driven player API) which already proved a
|
||||
programmatic, one-action-at-a-time player can drive a real game; that is a
|
||||
*different* external-process surface, but the action-application plumbing may
|
||||
be reusable.
|
||||
Loading…
Add table
Reference in a new issue