magicciv/docs/ai-production.md

# AI Production Guide — Magic Civilization

How the game ships AI, how learned policies are trained, how to add a new
specialist, and how difficulty levels are constructed. This is the
designer/engineering reference. The community-facing modder contract is
`docs/modding/ai-controller.md`.

---

## TL;DR

- Two controller families, both selectable per slot:
  - **`scripted:*`** — MCTS + heuristic AI from the `mc-ai` crate.
    Transparent, hand-tunable, fast. Anchors the named clan personalities.
  - **`learned:*`** — neural policy trained with MaskablePPO. Strong,
    opaque. Anchors high-difficulty tiers and tournament play.
- Difficulty is **orthogonal to controller choice** — handicaps + policy
  temperature stacked on top of either family.
- Specialization (rush / turtle / tech / economy) is via **different reward
  functions on the same architecture**, each a separate `best_model.zip`
  shipped as its own controller.
- Strong-AI ceiling is raised by **AlphaZero search at inference** + the
  **12-FFA self-play league** (Stage 6.7 + 6.9, post-launch).

---

## Coverage matrix — what each AI actually knows

This is the load-bearing diagnostic. The current `learned:duel-v1b` ships
with a 32-float hand-rolled observation vector that throws away most of
the engine's state. The scripted AI reads everything via `TacticalState` +
28 `ScoringWeights`.

| Concept | Scripted (`mc-ai` + 28 weights) | Learned (`encoders.py`) |
|---|---|---|
| Terrain biome / substrate / yields per tile | ✓ `TacticalTile` (state.rs:76-90) | ✗ encoder discards all tile data (encoders.py:18) |
| Flora / fauna entities | ✗ (not on wire) | ✗ |
| Production buildings | ✓ full data-driven catalog | ✗ **hardcoded 16-item list** (`CITY_QUEUE_ITEMS`) |
| Research / tech tree | ✓ via gate prerequisites | ✗ only `science_per_turn` |
| Strategic resource gating | ✓ `strategic_resources` | ✗ never reads stockpile |
| Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space |
| Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts |
| Tiles worked per city | ✓ `tiles_worked` | ✗ |
| **Multi-step pathfinding at decision time** | ✗ 1-action lookahead | ✗ 1-action lookahead |

**The engine is not the bottleneck.** `PlayerView` already exposes every
piece of state in the left column (`TileView` carries biome / substrate /
river / improvement / visible / explored; `CityView.buildable[]` carries
the full catalog; `ResearchView` carries the whole tech tree; per-opponent
`DiplomacyView` is on the wire). The encoder is the bottleneck.

This matrix drives the 5-stage roadmap in
[`ai-roadmap.md`](./ai-roadmap.md).

---

## AlphaZero-readiness audit (2026-05-18)

The codebase is already structured for an AlphaZero-grade learned AI; the
hooks exist but nothing is plugged into them.

| Hook | Location | Status |
|---|---|---|
| PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable |
| Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` |
| Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned |
| 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal |
| Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't |

---

## How a learned policy actually works

**Not seed search.** The seed sets the RNG for weight initialization +
environment rollout order. Different seeds produce different local optima
of the **same** learning process; we run multiple seeds because PPO is
high-variance.

**Weight optimization via gradient descent.** Concretely:

1. **Policy network.** A small MLP (~2 hidden layers, ~64 units) maps
   `observation (32 floats) → action distribution (322 logits)`. Weights
   start random.
2. **Rollout.** Policy samples action `a` from current state `s`;
   environment returns reward `r` and next state `s'`. Collect ~512 such
   transitions in a buffer.
3. **Advantage.** A critic network predicts expected return per state.
   Advantage `A(s, a) = actual_return − critic_prediction`. Positive
   advantage = action was better than baseline; negative = worse.
4. **PPO update.** Gradient-ascend the policy weights to make positive-
   advantage actions more probable, negative ones less, clipped so a
   single update can't move probabilities more than 20% (the "proximal"
   in PPO).
5. **Repeat** for 250k–1M environment steps. Weights drift from random to
   "actions that win games."

Three parallel seeds = three independent fits. We ship the best by
tournament win-rate; the others are discarded.

**Action masking.** MaskablePPO multiplies action logits by a legal-action
mask before sampling — the policy can never propose an illegal action.
Mask comes from `encode_legal_actions()` in
`tooling/rl_self_play/encoders.py`.

---

## Controller families

### `scripted:*`

| Controller ID | Use |
|---|---|
| `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. |
| `scripted:warmonger` | Personality: war-weight 2.0, expansion-weight 1.5. |
| `scripted:builder` | Personality: economy-weight 2.0, war-weight 0.5. |
| `scripted:tinkersmith` | Personality: tech-weight 2.5, military-weight 1.0. |
| `scripted:peaceful` | Personality: war-weight 0.3, diplomacy-weight 2.0. |
| `scripted:opportunist` | Personality: dynamic re-weighting from situation. |

Personalities are pure data in `public/games/age-of-dwarves/data/ai_personalities.json`.
Adding a new one is a JSON edit; no Rust changes.

### `learned:*`

| Controller ID | Use |
|---|---|
| `learned:duel-v1b` | First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. |
| `learned:rush` *(Stage 6.5)* | Reward-shaped for early military pressure. |
| `learned:turtle` *(6.5)* | Reward-shaped for defensive consolidation. |
| `learned:tech` *(6.5)* | Reward-shaped for research throughput. |
| `learned:economy` *(6.5)* | Reward-shaped for gold + city count. |
| `learned:league-genN` *(6.5)* | Self-play league generations. |

Each `learned:*` ships as its own `.wasm` mod (~400 KB after ONNX → tract
compile). Native `.so/.dll/.dylib` variants ship signed for users opting
into the faster path.

---

## Specialization via reward shaping

Same network architecture (`encoders.py` + 2-layer MLP). Different reward
function trained with the same `train.py` loop. Each variant produces a
separate `best_model.zip` registered as a distinct controller.

**Baseline reward** (current `magic_civ_env.py`):
```
+1.0   on win  (game_over event, winner == me)
-1.0   on loss (game_over event, winner != me)
+1e-2  per turn advance
+1e-3  per score_estimate delta
-5e-4  per step (anti-stalling)
```

**Specialist overlays** (added on top of baseline):

| Variant | Extra reward terms |
|---|---|
| `rush` | `+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80 |
| `turtle` | `+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built |
| `tech` | `+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked |
| `economy` | `+1e-3` per gold-reserve delta; `+0.5` per city founded |

Tuning rule: extras must sum to less than the terminal `±1.0` across a
typical game, otherwise the policy learns the shaping signal instead of
winning. Validate per-variant: `evaluate.py` must show win-rate ≥ baseline
when the specialist is used, not just "the specialist's shaping signal is
higher."

Adding a new specialist:
1. Add an entry to `magic_civ_env.py::RewardOverlay` enum + the shaping
   logic in `step()`.
2. Run `train.py --reward-overlay <name> --total-steps 250000 --seed 7`.
3. Evaluate vs `scripted:default` and vs `learned:duel-v1b`.
4. If win-rate ≥ 0.55 against both, ship as `learned:<name>`.

---

## Difficulty system

Difficulty is **never** "a weaker neural net." Two orthogonal levers:

### 1. Resource handicaps

Per-difficulty multipliers in
`public/games/age-of-dwarves/data/difficulty.json` (schema TBD):
```json
{
  "id": "settler",
  "human_resource_mul": 1.0,
  "ai_resource_mul": 0.7,
  "ai_unit_xp_bonus": 0
}
```

Applied at city-yield + unit-creation time in `mc-economy`.

### 2. Policy temperature

For `learned:*` controllers, a `temperature: f32` field on the controller
config divides the logits before sampling:
```
softmax(logits / T)
```
- `T = 1.0` — base policy.
- `T > 1.0` — softer distribution, more random, easier.
- `T < 1.0` — sharper, near-greedy, harder.
- `T → 0` — argmax (deterministic).

Implementation: ~10 LOC in `WasmAiController::decide_turn` (apply scaling
before the wasm guest samples, OR pass T through as a guest parameter and
let the guest apply it). Stage 6.5 work.

### Recommended Game 1 ladder

| Difficulty | Controller | T | Handicap |
|---|---|---|---|
| Settler | `scripted:peaceful` | n/a | AI ×0.7 |
| Chieftain | `scripted:default` | n/a | none |
| Warlord | `scripted:*` rotating | n/a | none |
| King | `learned:league-best` | 1.5 | none |
| Champion | `learned:league-best` | 0.3 | AI ×1.3 |

---

## Training infrastructure

### Hosts

- **Edit host (mac):** authoring; never trains.
- **Run host (apricot.lan):** 64-core Threadripper, 94 GB RAM, 2×3090.
  All training runs here.
- **Plum:** screenshot capture only; no training.

### Layout

```
tooling/rl_self_play/
├── train.py              # PPO loop, sb3-contrib MaskablePPO
├── evaluate.py           # Hard win-rate measurement
├── magic_civ_env.py      # Gymnasium wrapper + reward shaping
├── encoders.py           # PlayerView ↔ obs/action tensors
├── harness_client.py     # JSON-Lines subprocess to Godot headless
├── models/<run-name>/    # best_model.zip per training run
└── runs/<run-name>/      # tensorboard event files
```

`tooling/rl_self_play/models/` and `runs/` are gitignored (bulky; not
artifacts of the source repo).

### Single-game training (duel)

```bash
ssh apricot.lan
cd ~/Code/@projects/@magic-civilization
python -m tooling.rl_self_play.train \
  --run-name duel-v1b \
  --total-steps 250000 \
  --num-envs 16 \
  --seed 7 \
  --device cuda:1
```

`--num-envs N` runs N parallel headless Godot subprocesses; sb3-contrib
SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is
I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16
envs per training run, returns diminish.

### Parallel seed runs

Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses
on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total
— fits inside 94 GB with margin for the OS + other services.

### 12-FFA self-play league (Stage 6.5)

```bash
python -m tooling.rl_self_play.train \
  --run-name league-gen1 \
  --map-type 12ffa-huge \
  --opponent-pool models/league/gen0/best_model.zip \
  --total-steps 1000000 \
  --num-envs 4 \
  --seed 7 \
  --device cuda:1
```

12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the
bottleneck (the policy is a ~50k-param MLP). Verified on apricot
2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at
< 5% utilization. 1M steps ≈ 3.5h per league generation.

---

## Save format & forward compatibility

Every save records the `controller_id` AND `controller_hash` per slot
(SaveEnvelope v2). Loading a save with a controller the current install
doesn't have yields a friendly error from
`save_manager.gd::_validate_controllers_after_load`, not a crash mid-game.

**Mod authors:** never reuse a `controller_id` across incompatible weight
versions. Bump the version suffix (`learned:duel-v1c`, not `learned:duel-v1b`)
or saves from the old binary will mis-attribute to the new one.

---

## Ship-then-improve

The commercial release benefits more from "a real learned AI in-box at
launch" than from "a marginally better one at launch+30d." Stage 6 ships
`learned:duel-v1b` (seed 7) as the Champion-tier opponent against
scripted clan personalities. Stages 6.5–6.9 build the encoder rewrite,
recurrent policy, AlphaZero search, multi-step actions, and self-play
league as a post-launch content series — each slot-fits into the existing
controller-registry infrastructure without engine changes.

See [`ai-roadmap.md`](./ai-roadmap.md) for the patch-by-patch narrative.

---

## 5-stage post-launch architecture roadmap

Engineering-side reference. Designer-facing narrative in
[`ai-roadmap.md`](./ai-roadmap.md). Plan file:
`~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`.

### Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space

Replace the 32-float hand-rolled observation with a multi-modal encoder:

- **Spatial block**: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16.
- **Scalar block**: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance).
- **Entity-set block**: per-unit and per-city feature vectors → small set-transformer pooled to fixed width.

Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) →
concat → action head + value head. ~5M params, WASM-shippable via `tract`.

Companion changes:
- **Dynamic action space**: load `CITY_QUEUE_ITEMS` from `public/games/age-of-dwarves/data/buildings.json` + `units.json` at training start. Removes the 16-item hardcoding.
- **Behavioral cloning warm-start**: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min.
- **Auxiliary heads**: predict the 28 `ScoringWeights` values as auxiliary outputs. Free supervision signal.

### Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory

- Switch to `sb3-contrib RecurrentMaskablePPO`. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns.
- Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it.
- tract supports LSTM ops; WASM binary ~2× current.

### Stage 6.7 (v1.3) — AlphaZero search at inference

The single highest-leverage change. Engine hooks already exist (audit above).

- Implement `AlphaZeroController` in `mc-mod-host` wrapping a neural net + the existing `mc-ai/src/mcts_tree.rs` PUCT search.
- Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for `(prior, value)` evaluations at each expansion.
- 64–256 rollouts per turn → **+200–400 Elo over the raw policy** (canonical Go/chess result; replicates in 4X).
- The 28 `ScoringWeights` become the *initial* prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately.

### Stage 6.8 (v1.3) — Multi-step movement & strategic actions

- Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks:
  - `move_to(target_hex)` — A* path planned by the simulator, executed multi-turn.
  - `rally(target_hex)` — set city/production-building rally point.
  - `patrol(waypoints)` — repeat-cycle scouting.
  - `escort(unit_id)` — move with a friendly unit.
- Already partially exist: `TacticalUnit.patrol_order` field; gdext `set_rally` request. Plumbing surfaces them in `legal_actions` + `encoders.py`.
- Action space grows 322 → ~800; masking handles per-step legality.

### Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster

See "Specialization via reward shaping" and "Difficulty system" sections
above for the roster and ladder. League pipeline:

1. Freeze whatever 6.5–6.8 produces as `learned:league-gen0`.
2. Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo.
3. Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat.
4. Gen ≥ 5 → strong generalist. Round-robin tournament picks champion.

Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM,
~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates
in a workday.

---

## Cross-references

- Modder contract: `docs/modding/ai-controller.md`
- ABI decisions: `docs/modding/abi-decisions.md`
- Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`
- AiController trait: `src/simulator/crates/mc-player-api/src/controllers.rs`
- Reward shape: `tooling/rl_self_play/magic_civ_env.py`
- Observation encoder: `tooling/rl_self_play/encoders.py`