# AI Production Guide — Magic Civilization How the game ships AI, how learned policies are trained, how to add a new specialist, and how difficulty levels are constructed. This is the designer/engineering reference. The community-facing modder contract is `docs/modding/ai-controller.md`. --- ## TL;DR - Two controller families, both selectable per slot: - **`scripted:*`** — MCTS + heuristic AI from the `mc-ai` crate. Transparent, hand-tunable, fast. Anchors the named clan personalities. - **`learned:*`** — neural policy trained with MaskablePPO. Strong, opaque. Anchors high-difficulty tiers and tournament play. - Difficulty is **orthogonal to controller choice** — handicaps + policy temperature stacked on top of either family. - Specialization (rush / turtle / tech / economy) is via **different reward functions on the same architecture**, each a separate `best_model.zip` shipped as its own controller. - Strong-AI ceiling is raised by **AlphaZero search at inference** + the **12-FFA self-play league** (Stage 6.7 + 6.9, post-launch). --- ## Coverage matrix — what each AI actually knows This is the load-bearing diagnostic. The current `learned:duel-v1b` ships with a 32-float hand-rolled observation vector that throws away most of the engine's state. The scripted AI reads everything via `TacticalState` + 28 `ScoringWeights`. | Concept | Scripted (`mc-ai` + 28 weights) | Learned (`encoders.py`) | |---|---|---| | Terrain biome / substrate / yields per tile | ✓ `TacticalTile` (state.rs:76-90) | ✗ encoder discards all tile data (encoders.py:18) | | Flora / fauna entities | ✗ (not on wire) | ✗ | | Production buildings | ✓ full data-driven catalog | ✗ **hardcoded 16-item list** (`CITY_QUEUE_ITEMS`) | | Research / tech tree | ✓ via gate prerequisites | ✗ only `science_per_turn` | | Strategic resource gating | ✓ `strategic_resources` | ✗ never reads stockpile | | Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space | | Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts | | Tiles worked per city | ✓ `tiles_worked` | ✗ | | **Multi-step pathfinding at decision time** | ✗ 1-action lookahead | ✗ 1-action lookahead | **The engine is not the bottleneck.** `PlayerView` already exposes every piece of state in the left column (`TileView` carries biome / substrate / river / improvement / visible / explored; `CityView.buildable[]` carries the full catalog; `ResearchView` carries the whole tech tree; per-opponent `DiplomacyView` is on the wire). The encoder is the bottleneck. This matrix drives the 5-stage roadmap in [`ai-roadmap.md`](./ai-roadmap.md). --- ## AlphaZero-readiness audit (2026-05-18) The codebase is already structured for an AlphaZero-grade learned AI; the hooks exist but nothing is plugged into them. | Hook | Location | Status | |---|---|---| | PUCT tree MCTS with action priors + value-head rollout | `mc-ai/src/mcts_tree.rs:62-249` | ✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable | | Per-tile spatial state (CNN-ready) | `mc-player-api/src/view.rs:191-212` | ✓ all channels present in `TileView` | | Controller registry trait | `mc-player-api/src/controllers.rs:58-150` | ✓ a future `AlphaZeroController` plugs in alongside scripted/learned | | 28 evaluator weights as auxiliary loss targets | `mc-core/src/scoring_weights.rs:174` | ✓ N=28 scalar fields, ready-made supervision signal | | Fog-of-war + visibility filter | `mc-observation/src/fog.rs` + `mc-vision::compute_vision()` | ✓ wired into projection; policy never sees data it shouldn't | --- ## How a learned policy actually works **Not seed search.** The seed sets the RNG for weight initialization + environment rollout order. Different seeds produce different local optima of the **same** learning process; we run multiple seeds because PPO is high-variance. **Weight optimization via gradient descent.** Concretely: 1. **Policy network.** A small MLP (~2 hidden layers, ~64 units) maps `observation (32 floats) → action distribution (322 logits)`. Weights start random. 2. **Rollout.** Policy samples action `a` from current state `s`; environment returns reward `r` and next state `s'`. Collect ~512 such transitions in a buffer. 3. **Advantage.** A critic network predicts expected return per state. Advantage `A(s, a) = actual_return − critic_prediction`. Positive advantage = action was better than baseline; negative = worse. 4. **PPO update.** Gradient-ascend the policy weights to make positive- advantage actions more probable, negative ones less, clipped so a single update can't move probabilities more than 20% (the "proximal" in PPO). 5. **Repeat** for 250k–1M environment steps. Weights drift from random to "actions that win games." Three parallel seeds = three independent fits. We ship the best by tournament win-rate; the others are discarded. **Action masking.** MaskablePPO multiplies action logits by a legal-action mask before sampling — the policy can never propose an illegal action. Mask comes from `encode_legal_actions()` in `tooling/rl_self_play/encoders.py`. --- ## Controller families ### `scripted:*` | Controller ID | Use | |---|---| | `scripted:default` | The general-purpose MCTS+heuristic AI; default for unknown ids. | | `scripted:warmonger` | Personality: war-weight 2.0, expansion-weight 1.5. | | `scripted:builder` | Personality: economy-weight 2.0, war-weight 0.5. | | `scripted:tinkersmith` | Personality: tech-weight 2.5, military-weight 1.0. | | `scripted:peaceful` | Personality: war-weight 0.3, diplomacy-weight 2.0. | | `scripted:opportunist` | Personality: dynamic re-weighting from situation. | Personalities are pure data in `public/games/age-of-dwarves/data/ai_personalities.json`. Adding a new one is a JSON edit; no Rust changes. ### `learned:*` | Controller ID | Use | |---|---| | `learned:duel-v1b` | First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. | | `learned:rush` *(Stage 6.5)* | Reward-shaped for early military pressure. | | `learned:turtle` *(6.5)* | Reward-shaped for defensive consolidation. | | `learned:tech` *(6.5)* | Reward-shaped for research throughput. | | `learned:economy` *(6.5)* | Reward-shaped for gold + city count. | | `learned:league-genN` *(6.5)* | Self-play league generations. | Each `learned:*` ships as its own `.wasm` mod (~400 KB after ONNX → tract compile). Native `.so/.dll/.dylib` variants ship signed for users opting into the faster path. --- ## Specialization via reward shaping Same network architecture (`encoders.py` + 2-layer MLP). Different reward function trained with the same `train.py` loop. Each variant produces a separate `best_model.zip` registered as a distinct controller. **Baseline reward** (current `magic_civ_env.py`): ``` +1.0 on win (game_over event, winner == me) -1.0 on loss (game_over event, winner != me) +1e-2 per turn advance +1e-3 per score_estimate delta -5e-4 per step (anti-stalling) ``` **Specialist overlays** (added on top of baseline): | Variant | Extra reward terms | |---|---| | `rush` | `+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80 | | `turtle` | `+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built | | `tech` | `+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked | | `economy` | `+1e-3` per gold-reserve delta; `+0.5` per city founded | Tuning rule: extras must sum to less than the terminal `±1.0` across a typical game, otherwise the policy learns the shaping signal instead of winning. Validate per-variant: `evaluate.py` must show win-rate ≥ baseline when the specialist is used, not just "the specialist's shaping signal is higher." Adding a new specialist: 1. Add an entry to `magic_civ_env.py::RewardOverlay` enum + the shaping logic in `step()`. 2. Run `train.py --reward-overlay --total-steps 250000 --seed 7`. 3. Evaluate vs `scripted:default` and vs `learned:duel-v1b`. 4. If win-rate ≥ 0.55 against both, ship as `learned:`. --- ## Difficulty system Difficulty is **never** "a weaker neural net." Two orthogonal levers: ### 1. Resource handicaps Per-difficulty multipliers in `public/games/age-of-dwarves/data/difficulty.json` (schema TBD): ```json { "id": "settler", "human_resource_mul": 1.0, "ai_resource_mul": 0.7, "ai_unit_xp_bonus": 0 } ``` Applied at city-yield + unit-creation time in `mc-economy`. ### 2. Policy temperature For `learned:*` controllers, a `temperature: f32` field on the controller config divides the logits before sampling: ``` softmax(logits / T) ``` - `T = 1.0` — base policy. - `T > 1.0` — softer distribution, more random, easier. - `T < 1.0` — sharper, near-greedy, harder. - `T → 0` — argmax (deterministic). Implementation: ~10 LOC in `WasmAiController::decide_turn` (apply scaling before the wasm guest samples, OR pass T through as a guest parameter and let the guest apply it). Stage 6.5 work. ### Recommended Game 1 ladder | Difficulty | Controller | T | Handicap | |---|---|---|---| | Settler | `scripted:peaceful` | n/a | AI ×0.7 | | Chieftain | `scripted:default` | n/a | none | | Warlord | `scripted:*` rotating | n/a | none | | King | `learned:league-best` | 1.5 | none | | Champion | `learned:league-best` | 0.3 | AI ×1.3 | --- ## Training infrastructure ### Hosts - **Edit host (mac):** authoring; never trains. - **Run host (apricot.lan):** 64-core Threadripper, 94 GB RAM, 2×3090. All training runs here. - **Plum:** screenshot capture only; no training. ### Layout ``` tooling/rl_self_play/ ├── train.py # PPO loop, sb3-contrib MaskablePPO ├── evaluate.py # Hard win-rate measurement ├── magic_civ_env.py # Gymnasium wrapper + reward shaping ├── encoders.py # PlayerView ↔ obs/action tensors ├── harness_client.py # JSON-Lines subprocess to Godot headless ├── models// # best_model.zip per training run └── runs// # tensorboard event files ``` `tooling/rl_self_play/models/` and `runs/` are gitignored (bulky; not artifacts of the source repo). ### Single-game training (duel) ```bash ssh apricot.lan cd ~/Code/@projects/@magic-civilization python -m tooling.rl_self_play.train \ --run-name duel-v1b \ --total-steps 250000 \ --num-envs 16 \ --seed 7 \ --device cuda:1 ``` `--num-envs N` runs N parallel headless Godot subprocesses; sb3-contrib SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16 envs per training run, returns diminish. ### Parallel seed runs Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total — fits inside 94 GB with margin for the OS + other services. ### 12-FFA self-play league (Stage 6.5) ```bash python -m tooling.rl_self_play.train \ --run-name league-gen1 \ --map-type 12ffa-huge \ --opponent-pool models/league/gen0/best_model.zip \ --total-steps 1000000 \ --num-envs 4 \ --seed 7 \ --device cuda:1 ``` 12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the bottleneck (the policy is a ~50k-param MLP). Verified on apricot 2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at < 5% utilization. 1M steps ≈ 3.5h per league generation. --- ## Save format & forward compatibility Every save records the `controller_id` AND `controller_hash` per slot (SaveEnvelope v2). Loading a save with a controller the current install doesn't have yields a friendly error from `save_manager.gd::_validate_controllers_after_load`, not a crash mid-game. **Mod authors:** never reuse a `controller_id` across incompatible weight versions. Bump the version suffix (`learned:duel-v1c`, not `learned:duel-v1b`) or saves from the old binary will mis-attribute to the new one. --- ## Ship-then-improve The commercial release benefits more from "a real learned AI in-box at launch" than from "a marginally better one at launch+30d." Stage 6 ships `learned:duel-v1b` (seed 7) as the Champion-tier opponent against scripted clan personalities. Stages 6.5–6.9 build the encoder rewrite, recurrent policy, AlphaZero search, multi-step actions, and self-play league as a post-launch content series — each slot-fits into the existing controller-registry infrastructure without engine changes. See [`ai-roadmap.md`](./ai-roadmap.md) for the patch-by-patch narrative. --- ## 5-stage post-launch architecture roadmap Engineering-side reference. Designer-facing narrative in [`ai-roadmap.md`](./ai-roadmap.md). Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md`. ### Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space Replace the 32-float hand-rolled observation with a multi-modal encoder: - **Spatial block**: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16. - **Scalar block**: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance). - **Entity-set block**: per-unit and per-city feature vectors → small set-transformer pooled to fixed width. Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) → concat → action head + value head. ~5M params, WASM-shippable via `tract`. Companion changes: - **Dynamic action space**: load `CITY_QUEUE_ITEMS` from `public/games/age-of-dwarves/data/buildings.json` + `units.json` at training start. Removes the 16-item hardcoding. - **Behavioral cloning warm-start**: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min. - **Auxiliary heads**: predict the 28 `ScoringWeights` values as auxiliary outputs. Free supervision signal. ### Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory - Switch to `sb3-contrib RecurrentMaskablePPO`. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns. - Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it. - tract supports LSTM ops; WASM binary ~2× current. ### Stage 6.7 (v1.3) — AlphaZero search at inference The single highest-leverage change. Engine hooks already exist (audit above). - Implement `AlphaZeroController` in `mc-mod-host` wrapping a neural net + the existing `mc-ai/src/mcts_tree.rs` PUCT search. - Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for `(prior, value)` evaluations at each expansion. - 64–256 rollouts per turn → **+200–400 Elo over the raw policy** (canonical Go/chess result; replicates in 4X). - The 28 `ScoringWeights` become the *initial* prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately. ### Stage 6.8 (v1.3) — Multi-step movement & strategic actions - Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks: - `move_to(target_hex)` — A* path planned by the simulator, executed multi-turn. - `rally(target_hex)` — set city/production-building rally point. - `patrol(waypoints)` — repeat-cycle scouting. - `escort(unit_id)` — move with a friendly unit. - Already partially exist: `TacticalUnit.patrol_order` field; gdext `set_rally` request. Plumbing surfaces them in `legal_actions` + `encoders.py`. - Action space grows 322 → ~800; masking handles per-step legality. ### Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster See "Specialization via reward shaping" and "Difficulty system" sections above for the roster and ladder. League pipeline: 1. Freeze whatever 6.5–6.8 produces as `learned:league-gen0`. 2. Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo. 3. Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat. 4. Gen ≥ 5 → strong generalist. Round-robin tournament picks champion. Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM, ~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates in a workday. --- ## Cross-references - Modder contract: `docs/modding/ai-controller.md` - ABI decisions: `docs/modding/abi-decisions.md` - Plan file: `~/.claude/plans/in-the-game-civilization-elegant-popcorn.md` - AiController trait: `src/simulator/crates/mc-player-api/src/controllers.rs` - Reward shape: `tooling/rl_self_play/magic_civ_env.py` - Observation encoder: `tooling/rl_self_play/encoders.py`