**The engine is not the bottleneck.** `PlayerView` already exposes every
piece of state in the left column (`TileView` carries biome / substrate /
river / improvement / visible / explored; `CityView.buildable[]` carries
the full catalog; `ResearchView` carries the whole tech tree; per-opponent
`DiplomacyView` is on the wire). The encoder is the bottleneck.
This matrix drives the 5-stage roadmap in
[`ai-roadmap.md`](./ai-roadmap.md).
---
## AlphaZero-readiness audit (2026-05-18)
The codebase is already structured for an AlphaZero-grade learned AI; the
hooks exist but nothing is plugged into them.
| Hook | Location | Status |
|---|---|---|
| PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable |
| Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` |
| Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned |
| 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal |
| Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't |
---
## How a learned policy actually works
**Not seed search.** The seed sets the RNG for weight initialization +
environment rollout order. Different seeds produce different local optima
of the **same** learning process; we run multiple seeds because PPO is
high-variance.
**Weight optimization via gradient descent.** Concretely:
1.**Policy network.** A small MLP (~2 hidden layers, ~64 units) maps
`observation (32 floats) → action distribution (322 logits)`. Weights
start random.
2.**Rollout.** Policy samples action `a` from current state `s`;
environment returns reward `r` and next state `s'`. Collect ~512 such
transitions in a buffer.
3.**Advantage.** A critic network predicts expected return per state.
Advantage `A(s, a) = actual_return − critic_prediction`. Positive
advantage = action was better than baseline; negative = worse.
4.**PPO update.** Gradient-ascend the policy weights to make positive-
advantage actions more probable, negative ones less, clipped so a
single update can't move probabilities more than 20% (the "proximal"
in PPO).
5.**Repeat** for 250k–1M environment steps. Weights drift from random to
"actions that win games."
Three parallel seeds = three independent fits. We ship the best by
tournament win-rate; the others are discarded.
**Action masking.** MaskablePPO multiplies action logits by a legal-action
mask before sampling — the policy can never propose an illegal action.
Mask comes from `encode_legal_actions()` in
`tooling/rl_self_play/encoders.py`.
---
## Controller families
### `scripted:*`
| Controller ID | Use |
|---|---|
| `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. |
concat → action head + value head. ~5M params, WASM-shippable via `tract`.
Companion changes:
- **Dynamic action space**: load `CITY_QUEUE_ITEMS` from `public/games/age-of-dwarves/data/buildings.json` + `units.json` at training start. Removes the 16-item hardcoding.
- **Behavioral cloning warm-start**: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min.
- **Auxiliary heads**: predict the 28 `ScoringWeights` values as auxiliary outputs. Free supervision signal.
### Stage 6.7 (v1.3) — AlphaZero search at inference
The single highest-leverage change. Engine hooks already exist (audit above).
- Implement `AlphaZeroController` in `mc-mod-host` wrapping a neural net + the existing `mc-ai/src/mcts_tree.rs` PUCT search.
- Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for `(prior, value)` evaluations at each expansion.
- 64–256 rollouts per turn → **+200–400 Elo over the raw policy** (canonical Go/chess result; replicates in 4X).
- The 28 `ScoringWeights` become the *initial* prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately.
### Stage 6.8 (v1.3) — Multi-step movement & strategic actions
- Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks:
-`move_to(target_hex)` — A* path planned by the simulator, executed multi-turn.
-`rally(target_hex)` — set city/production-building rally point.
-`patrol(waypoints)` — repeat-cycle scouting.
-`escort(unit_id)` — move with a friendly unit.
- Already partially exist: `TacticalUnit.patrol_order` field; gdext `set_rally` request. Plumbing surfaces them in `legal_actions` + `encoders.py`.