16 KiB
AI Production Guide — Magic Civilization
How the game ships AI, how learned policies are trained, how to add a new
specialist, and how difficulty levels are constructed. This is the
designer/engineering reference. The community-facing modder contract is
docs/modding/ai-controller.md.
TL;DR
- Two controller families, both selectable per slot:
scripted:*— MCTS + heuristic AI from themc-aicrate. Transparent, hand-tunable, fast. Anchors the named clan personalities.learned:*— neural policy trained with MaskablePPO. Strong, opaque. Anchors high-difficulty tiers and tournament play.
- Difficulty is orthogonal to controller choice — handicaps + policy temperature stacked on top of either family.
- Specialization (rush / turtle / tech / economy) is via different reward
functions on the same architecture, each a separate
best_model.zipshipped as its own controller. - Strong-AI ceiling is raised by AlphaZero search at inference + the 12-FFA self-play league (Stage 6.7 + 6.9, post-launch).
Coverage matrix — what each AI actually knows
This is the load-bearing diagnostic. The current learned:duel-v1b ships
with a 32-float hand-rolled observation vector that throws away most of
the engine's state. The scripted AI reads everything via TacticalState +
28 ScoringWeights.
| Concept | Scripted (mc-ai + 28 weights) |
Learned (encoders.py) |
|---|---|---|
| Terrain biome / substrate / yields per tile | ✓ TacticalTile (state.rs:76-90) |
✗ encoder discards all tile data (encoders.py:18) |
| Flora / fauna entities | ✗ (not on wire) | ✗ |
| Production buildings | ✓ full data-driven catalog | ✗ hardcoded 16-item list (CITY_QUEUE_ITEMS) |
| Research / tech tree | ✓ via gate prerequisites | ✗ only science_per_turn |
| Strategic resource gating | ✓ strategic_resources |
✗ never reads stockpile |
| Rally / patrol / scout actions | △ stored, not actively issued | ✗ not in action space |
| Diplomacy detail + personality | ✓ 6 strategic axes per opponent | ✗ only war/peace/borders counts |
| Tiles worked per city | ✓ tiles_worked |
✗ |
| Multi-step pathfinding at decision time | ✗ 1-action lookahead | ✗ 1-action lookahead |
The engine is not the bottleneck. PlayerView already exposes every
piece of state in the left column (TileView carries biome / substrate /
river / improvement / visible / explored; CityView.buildable[] carries
the full catalog; ResearchView carries the whole tech tree; per-opponent
DiplomacyView is on the wire). The encoder is the bottleneck.
This matrix drives the 5-stage roadmap in
ai-roadmap.md.
AlphaZero-readiness audit (2026-05-18)
The codebase is already structured for an AlphaZero-grade learned AI; the hooks exist but nothing is plugged into them.
| Hook | Location | Status |
|---|---|---|
| PUCT tree MCTS with action priors + value-head rollout | mc-ai/src/mcts_tree.rs:62-249 |
✓ ready; both TreeState::action_prior() and TreeState::rollout() are overridable |
| Per-tile spatial state (CNN-ready) | mc-player-api/src/view.rs:191-212 |
✓ all channels present in TileView |
| Controller registry trait | mc-player-api/src/controllers.rs:58-150 |
✓ a future AlphaZeroController plugs in alongside scripted/learned |
| 28 evaluator weights as auxiliary loss targets | mc-core/src/scoring_weights.rs:174 |
✓ N=28 scalar fields, ready-made supervision signal |
| Fog-of-war + visibility filter | mc-observation/src/fog.rs + mc-vision::compute_vision() |
✓ wired into projection; policy never sees data it shouldn't |
How a learned policy actually works
Not seed search. The seed sets the RNG for weight initialization + environment rollout order. Different seeds produce different local optima of the same learning process; we run multiple seeds because PPO is high-variance.
Weight optimization via gradient descent. Concretely:
- Policy network. A small MLP (~2 hidden layers, ~64 units) maps
observation (32 floats) → action distribution (322 logits). Weights start random. - Rollout. Policy samples action
afrom current states; environment returns rewardrand next states'. Collect ~512 such transitions in a buffer. - Advantage. A critic network predicts expected return per state.
Advantage
A(s, a) = actual_return − critic_prediction. Positive advantage = action was better than baseline; negative = worse. - PPO update. Gradient-ascend the policy weights to make positive- advantage actions more probable, negative ones less, clipped so a single update can't move probabilities more than 20% (the "proximal" in PPO).
- Repeat for 250k–1M environment steps. Weights drift from random to "actions that win games."
Three parallel seeds = three independent fits. We ship the best by tournament win-rate; the others are discarded.
Action masking. MaskablePPO multiplies action logits by a legal-action
mask before sampling — the policy can never propose an illegal action.
Mask comes from encode_legal_actions() in
tooling/rl_self_play/encoders.py.
Controller families
scripted:*
| Controller ID | Use |
|---|---|
scripted:default |
The general-purpose MCTS+heuristic AI; default for unknown ids. |
scripted:warmonger |
Personality: war-weight 2.0, expansion-weight 1.5. |
scripted:builder |
Personality: economy-weight 2.0, war-weight 0.5. |
scripted:tinkersmith |
Personality: tech-weight 2.5, military-weight 1.0. |
scripted:peaceful |
Personality: war-weight 0.3, diplomacy-weight 2.0. |
scripted:opportunist |
Personality: dynamic re-weighting from situation. |
Personalities are pure data in public/games/age-of-dwarves/data/ai_personalities.json.
Adding a new one is a JSON edit; no Rust changes.
learned:*
| Controller ID | Use |
|---|---|
learned:duel-v1b |
First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist. |
learned:rush (Stage 6.5) |
Reward-shaped for early military pressure. |
learned:turtle (6.5) |
Reward-shaped for defensive consolidation. |
learned:tech (6.5) |
Reward-shaped for research throughput. |
learned:economy (6.5) |
Reward-shaped for gold + city count. |
learned:league-genN (6.5) |
Self-play league generations. |
Each learned:* ships as its own .wasm mod (~400 KB after ONNX → tract
compile). Native .so/.dll/.dylib variants ship signed for users opting
into the faster path.
Specialization via reward shaping
Same network architecture (encoders.py + 2-layer MLP). Different reward
function trained with the same train.py loop. Each variant produces a
separate best_model.zip registered as a distinct controller.
Baseline reward (current magic_civ_env.py):
+1.0 on win (game_over event, winner == me)
-1.0 on loss (game_over event, winner != me)
+1e-2 per turn advance
+1e-3 per score_estimate delta
-5e-4 per step (anti-stalling)
Specialist overlays (added on top of baseline):
| Variant | Extra reward terms |
|---|---|
rush |
+0.5 per enemy unit killed before turn 80; -1e-2 per turn after turn 80 |
turtle |
+0.05 per friendly unit fortified-on-defense-tile; +0.1 per wall built |
tech |
+5e-3 per science_per_turn delta; +0.3 per tech unlocked |
economy |
+1e-3 per gold-reserve delta; +0.5 per city founded |
Tuning rule: extras must sum to less than the terminal ±1.0 across a
typical game, otherwise the policy learns the shaping signal instead of
winning. Validate per-variant: evaluate.py must show win-rate ≥ baseline
when the specialist is used, not just "the specialist's shaping signal is
higher."
Adding a new specialist:
- Add an entry to
magic_civ_env.py::RewardOverlayenum + the shaping logic instep(). - Run
train.py --reward-overlay <name> --total-steps 250000 --seed 7. - Evaluate vs
scripted:defaultand vslearned:duel-v1b. - If win-rate ≥ 0.55 against both, ship as
learned:<name>.
Difficulty system
Difficulty is never "a weaker neural net." Two orthogonal levers:
1. Resource handicaps
Per-difficulty multipliers in
public/games/age-of-dwarves/data/difficulty.json (schema TBD):
{
"id": "settler",
"human_resource_mul": 1.0,
"ai_resource_mul": 0.7,
"ai_unit_xp_bonus": 0
}
Applied at city-yield + unit-creation time in mc-economy.
2. Policy temperature
For learned:* controllers, a temperature: f32 field on the controller
config divides the logits before sampling:
softmax(logits / T)
T = 1.0— base policy.T > 1.0— softer distribution, more random, easier.T < 1.0— sharper, near-greedy, harder.T → 0— argmax (deterministic).
Implementation: ~10 LOC in WasmAiController::decide_turn (apply scaling
before the wasm guest samples, OR pass T through as a guest parameter and
let the guest apply it). Stage 6.5 work.
Recommended Game 1 ladder
| Difficulty | Controller | T | Handicap |
|---|---|---|---|
| Settler | scripted:peaceful |
n/a | AI ×0.7 |
| Chieftain | scripted:default |
n/a | none |
| Warlord | scripted:* rotating |
n/a | none |
| King | learned:league-best |
1.5 | none |
| Champion | learned:league-best |
0.3 | AI ×1.3 |
Training infrastructure
Hosts
- Edit host (mac): authoring; never trains.
- Run host (apricot.lan): 64-core Threadripper, 94 GB RAM, 2×3090. All training runs here.
- Plum: screenshot capture only; no training.
Layout
tooling/rl_self_play/
├── train.py # PPO loop, sb3-contrib MaskablePPO
├── evaluate.py # Hard win-rate measurement
├── magic_civ_env.py # Gymnasium wrapper + reward shaping
├── encoders.py # PlayerView ↔ obs/action tensors
├── harness_client.py # JSON-Lines subprocess to Godot headless
├── models/<run-name>/ # best_model.zip per training run
└── runs/<run-name>/ # tensorboard event files
tooling/rl_self_play/models/ and runs/ are gitignored (bulky; not
artifacts of the source repo).
Single-game training (duel)
ssh apricot.lan
cd ~/Code/@projects/@magic-civilization
python -m tooling.rl_self_play.train \
--run-name duel-v1b \
--total-steps 250000 \
--num-envs 16 \
--seed 7 \
--device cuda:1
--num-envs N runs N parallel headless Godot subprocesses; sb3-contrib
SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is
I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16
envs per training run, returns diminish.
Parallel seed runs
Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total — fits inside 94 GB with margin for the OS + other services.
12-FFA self-play league (Stage 6.5)
python -m tooling.rl_self_play.train \
--run-name league-gen1 \
--map-type 12ffa-huge \
--opponent-pool models/league/gen0/best_model.zip \
--total-steps 1000000 \
--num-envs 4 \
--seed 7 \
--device cuda:1
12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the bottleneck (the policy is a ~50k-param MLP). Verified on apricot 2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at < 5% utilization. 1M steps ≈ 3.5h per league generation.
Save format & forward compatibility
Every save records the controller_id AND controller_hash per slot
(SaveEnvelope v2). Loading a save with a controller the current install
doesn't have yields a friendly error from
save_manager.gd::_validate_controllers_after_load, not a crash mid-game.
Mod authors: never reuse a controller_id across incompatible weight
versions. Bump the version suffix (learned:duel-v1c, not learned:duel-v1b)
or saves from the old binary will mis-attribute to the new one.
Ship-then-improve
The commercial release benefits more from "a real learned AI in-box at
launch" than from "a marginally better one at launch+30d." Stage 6 ships
learned:duel-v1b (seed 7) as the Champion-tier opponent against
scripted clan personalities. Stages 6.5–6.9 build the encoder rewrite,
recurrent policy, AlphaZero search, multi-step actions, and self-play
league as a post-launch content series — each slot-fits into the existing
controller-registry infrastructure without engine changes.
See ai-roadmap.md for the patch-by-patch narrative.
5-stage post-launch architecture roadmap
Engineering-side reference. Designer-facing narrative in
ai-roadmap.md. Plan file:
~/.claude/plans/in-the-game-civilization-elegant-popcorn.md.
Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space
Replace the 32-float hand-rolled observation with a multi-modal encoder:
- Spatial block: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16.
- Scalar block: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance).
- Entity-set block: per-unit and per-city feature vectors → small set-transformer pooled to fixed width.
Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) →
concat → action head + value head. ~5M params, WASM-shippable via tract.
Companion changes:
- Dynamic action space: load
CITY_QUEUE_ITEMSfrompublic/games/age-of-dwarves/data/buildings.json+units.jsonat training start. Removes the 16-item hardcoding. - Behavioral cloning warm-start: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min.
- Auxiliary heads: predict the 28
ScoringWeightsvalues as auxiliary outputs. Free supervision signal.
Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory
- Switch to
sb3-contrib RecurrentMaskablePPO. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns. - Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it.
- tract supports LSTM ops; WASM binary ~2× current.
Stage 6.7 (v1.3) — AlphaZero search at inference
The single highest-leverage change. Engine hooks already exist (audit above).
- Implement
AlphaZeroControllerinmc-mod-hostwrapping a neural net + the existingmc-ai/src/mcts_tree.rsPUCT search. - Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for
(prior, value)evaluations at each expansion. - 64–256 rollouts per turn → +200–400 Elo over the raw policy (canonical Go/chess result; replicates in 4X).
- The 28
ScoringWeightsbecome the initial prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately.
Stage 6.8 (v1.3) — Multi-step movement & strategic actions
- Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks:
move_to(target_hex)— A* path planned by the simulator, executed multi-turn.rally(target_hex)— set city/production-building rally point.patrol(waypoints)— repeat-cycle scouting.escort(unit_id)— move with a friendly unit.
- Already partially exist:
TacticalUnit.patrol_orderfield; gdextset_rallyrequest. Plumbing surfaces them inlegal_actions+encoders.py. - Action space grows 322 → ~800; masking handles per-step legality.
Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster
See "Specialization via reward shaping" and "Difficulty system" sections above for the roster and ladder. League pipeline:
- Freeze whatever 6.5–6.8 produces as
learned:league-gen0. - Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo.
- Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat.
- Gen ≥ 5 → strong generalist. Round-robin tournament picks champion.
Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM, ~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates in a workday.
Cross-references
- Modder contract:
docs/modding/ai-controller.md - ABI decisions:
docs/modding/abi-decisions.md - Plan file:
~/.claude/plans/in-the-game-civilization-elegant-popcorn.md - AiController trait:
src/simulator/crates/mc-player-api/src/controllers.rs - Reward shape:
tooling/rl_self_play/magic_civ_env.py - Observation encoder:
tooling/rl_self_play/encoders.py