magicciv/docs/ai-production.md

16 KiB
Raw Permalink Blame History

AI Production Guide — Magic Civilization

How the game ships AI, how learned policies are trained, how to add a new specialist, and how difficulty levels are constructed. This is the designer/engineering reference. The community-facing modder contract is docs/modding/ai-controller.md.


TL;DR

  • Two controller families, both selectable per slot:
    • scripted:* — MCTS + heuristic AI from the mc-ai crate. Transparent, hand-tunable, fast. Anchors the named clan personalities.
    • learned:* — neural policy trained with MaskablePPO. Strong, opaque. Anchors high-difficulty tiers and tournament play.
  • Difficulty is orthogonal to controller choice — handicaps + policy temperature stacked on top of either family.
  • Specialization (rush / turtle / tech / economy) is via different reward functions on the same architecture, each a separate best_model.zip shipped as its own controller.
  • Strong-AI ceiling is raised by AlphaZero search at inference + the 12-FFA self-play league (Stage 6.7 + 6.9, post-launch).

Coverage matrix — what each AI actually knows

This is the load-bearing diagnostic. The current learned:duel-v1b ships with a 32-float hand-rolled observation vector that throws away most of the engine's state. The scripted AI reads everything via TacticalState + 28 ScoringWeights.

Concept Scripted (mc-ai + 28 weights) Learned (encoders.py)
Terrain biome / substrate / yields per tile TacticalTile (state.rs:76-90) ✗ encoder discards all tile data (encoders.py:18)
Flora / fauna entities ✗ (not on wire)
Production buildings ✓ full data-driven catalog hardcoded 16-item list (CITY_QUEUE_ITEMS)
Research / tech tree ✓ via gate prerequisites ✗ only science_per_turn
Strategic resource gating strategic_resources ✗ never reads stockpile
Rally / patrol / scout actions △ stored, not actively issued ✗ not in action space
Diplomacy detail + personality ✓ 6 strategic axes per opponent ✗ only war/peace/borders counts
Tiles worked per city tiles_worked
Multi-step pathfinding at decision time ✗ 1-action lookahead ✗ 1-action lookahead

The engine is not the bottleneck. PlayerView already exposes every piece of state in the left column (TileView carries biome / substrate / river / improvement / visible / explored; CityView.buildable[] carries the full catalog; ResearchView carries the whole tech tree; per-opponent DiplomacyView is on the wire). The encoder is the bottleneck.

This matrix drives the 5-stage roadmap in ai-roadmap.md.


AlphaZero-readiness audit (2026-05-18)

The codebase is already structured for an AlphaZero-grade learned AI; the hooks exist but nothing is plugged into them.

Hook Location Status
PUCT tree MCTS with action priors + value-head rollout mc-ai/src/mcts_tree.rs:62-249 ✓ ready; both TreeState::action_prior() and TreeState::rollout() are overridable
Per-tile spatial state (CNN-ready) mc-player-api/src/view.rs:191-212 ✓ all channels present in TileView
Controller registry trait mc-player-api/src/controllers.rs:58-150 ✓ a future AlphaZeroController plugs in alongside scripted/learned
28 evaluator weights as auxiliary loss targets mc-core/src/scoring_weights.rs:174 ✓ N=28 scalar fields, ready-made supervision signal
Fog-of-war + visibility filter mc-observation/src/fog.rs + mc-vision::compute_vision() ✓ wired into projection; policy never sees data it shouldn't

How a learned policy actually works

Not seed search. The seed sets the RNG for weight initialization + environment rollout order. Different seeds produce different local optima of the same learning process; we run multiple seeds because PPO is high-variance.

Weight optimization via gradient descent. Concretely:

  1. Policy network. A small MLP (~2 hidden layers, ~64 units) maps observation (32 floats) → action distribution (322 logits). Weights start random.
  2. Rollout. Policy samples action a from current state s; environment returns reward r and next state s'. Collect ~512 such transitions in a buffer.
  3. Advantage. A critic network predicts expected return per state. Advantage A(s, a) = actual_return critic_prediction. Positive advantage = action was better than baseline; negative = worse.
  4. PPO update. Gradient-ascend the policy weights to make positive- advantage actions more probable, negative ones less, clipped so a single update can't move probabilities more than 20% (the "proximal" in PPO).
  5. Repeat for 250k1M environment steps. Weights drift from random to "actions that win games."

Three parallel seeds = three independent fits. We ship the best by tournament win-rate; the others are discarded.

Action masking. MaskablePPO multiplies action logits by a legal-action mask before sampling — the policy can never propose an illegal action. Mask comes from encode_legal_actions() in tooling/rl_self_play/encoders.py.


Controller families

scripted:*

Controller ID Use
scripted:default The general-purpose MCTS+heuristic AI; default for unknown ids.
scripted:warmonger Personality: war-weight 2.0, expansion-weight 1.5.
scripted:builder Personality: economy-weight 2.0, war-weight 0.5.
scripted:tinkersmith Personality: tech-weight 2.5, military-weight 1.0.
scripted:peaceful Personality: war-weight 0.3, diplomacy-weight 2.0.
scripted:opportunist Personality: dynamic re-weighting from situation.

Personalities are pure data in public/games/age-of-dwarves/data/ai_personalities.json. Adding a new one is a JSON edit; no Rust changes.

learned:*

Controller ID Use
learned:duel-v1b First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist.
learned:rush (Stage 6.5) Reward-shaped for early military pressure.
learned:turtle (6.5) Reward-shaped for defensive consolidation.
learned:tech (6.5) Reward-shaped for research throughput.
learned:economy (6.5) Reward-shaped for gold + city count.
learned:league-genN (6.5) Self-play league generations.

Each learned:* ships as its own .wasm mod (~400 KB after ONNX → tract compile). Native .so/.dll/.dylib variants ship signed for users opting into the faster path.


Specialization via reward shaping

Same network architecture (encoders.py + 2-layer MLP). Different reward function trained with the same train.py loop. Each variant produces a separate best_model.zip registered as a distinct controller.

Baseline reward (current magic_civ_env.py):

+1.0   on win  (game_over event, winner == me)
-1.0   on loss (game_over event, winner != me)
+1e-2  per turn advance
+1e-3  per score_estimate delta
-5e-4  per step (anti-stalling)

Specialist overlays (added on top of baseline):

Variant Extra reward terms
rush +0.5 per enemy unit killed before turn 80; -1e-2 per turn after turn 80
turtle +0.05 per friendly unit fortified-on-defense-tile; +0.1 per wall built
tech +5e-3 per science_per_turn delta; +0.3 per tech unlocked
economy +1e-3 per gold-reserve delta; +0.5 per city founded

Tuning rule: extras must sum to less than the terminal ±1.0 across a typical game, otherwise the policy learns the shaping signal instead of winning. Validate per-variant: evaluate.py must show win-rate ≥ baseline when the specialist is used, not just "the specialist's shaping signal is higher."

Adding a new specialist:

  1. Add an entry to magic_civ_env.py::RewardOverlay enum + the shaping logic in step().
  2. Run train.py --reward-overlay <name> --total-steps 250000 --seed 7.
  3. Evaluate vs scripted:default and vs learned:duel-v1b.
  4. If win-rate ≥ 0.55 against both, ship as learned:<name>.

Difficulty system

Difficulty is never "a weaker neural net." Two orthogonal levers:

1. Resource handicaps

Per-difficulty multipliers in public/games/age-of-dwarves/data/difficulty.json (schema TBD):

{
  "id": "settler",
  "human_resource_mul": 1.0,
  "ai_resource_mul": 0.7,
  "ai_unit_xp_bonus": 0
}

Applied at city-yield + unit-creation time in mc-economy.

2. Policy temperature

For learned:* controllers, a temperature: f32 field on the controller config divides the logits before sampling:

softmax(logits / T)
  • T = 1.0 — base policy.
  • T > 1.0 — softer distribution, more random, easier.
  • T < 1.0 — sharper, near-greedy, harder.
  • T → 0 — argmax (deterministic).

Implementation: ~10 LOC in WasmAiController::decide_turn (apply scaling before the wasm guest samples, OR pass T through as a guest parameter and let the guest apply it). Stage 6.5 work.

Difficulty Controller T Handicap
Settler scripted:peaceful n/a AI ×0.7
Chieftain scripted:default n/a none
Warlord scripted:* rotating n/a none
King learned:league-best 1.5 none
Champion learned:league-best 0.3 AI ×1.3

Training infrastructure

Hosts

  • Edit host (mac): authoring; never trains.
  • Run host (apricot.lan): 64-core Threadripper, 94 GB RAM, 2×3090. All training runs here.
  • Plum: screenshot capture only; no training.

Layout

tooling/rl_self_play/
├── train.py              # PPO loop, sb3-contrib MaskablePPO
├── evaluate.py           # Hard win-rate measurement
├── magic_civ_env.py      # Gymnasium wrapper + reward shaping
├── encoders.py           # PlayerView ↔ obs/action tensors
├── harness_client.py     # JSON-Lines subprocess to Godot headless
├── models/<run-name>/    # best_model.zip per training run
└── runs/<run-name>/      # tensorboard event files

tooling/rl_self_play/models/ and runs/ are gitignored (bulky; not artifacts of the source repo).

Single-game training (duel)

ssh apricot.lan
cd ~/Code/@projects/@magic-civilization
python -m tooling.rl_self_play.train \
  --run-name duel-v1b \
  --total-steps 250000 \
  --num-envs 16 \
  --seed 7 \
  --device cuda:1

--num-envs N runs N parallel headless Godot subprocesses; sb3-contrib SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16 envs per training run, returns diminish.

Parallel seed runs

Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total — fits inside 94 GB with margin for the OS + other services.

12-FFA self-play league (Stage 6.5)

python -m tooling.rl_self_play.train \
  --run-name league-gen1 \
  --map-type 12ffa-huge \
  --opponent-pool models/league/gen0/best_model.zip \
  --total-steps 1000000 \
  --num-envs 4 \
  --seed 7 \
  --device cuda:1

12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the bottleneck (the policy is a ~50k-param MLP). Verified on apricot 2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at < 5% utilization. 1M steps ≈ 3.5h per league generation.


Save format & forward compatibility

Every save records the controller_id AND controller_hash per slot (SaveEnvelope v2). Loading a save with a controller the current install doesn't have yields a friendly error from save_manager.gd::_validate_controllers_after_load, not a crash mid-game.

Mod authors: never reuse a controller_id across incompatible weight versions. Bump the version suffix (learned:duel-v1c, not learned:duel-v1b) or saves from the old binary will mis-attribute to the new one.


Ship-then-improve

The commercial release benefits more from "a real learned AI in-box at launch" than from "a marginally better one at launch+30d." Stage 6 ships learned:duel-v1b (seed 7) as the Champion-tier opponent against scripted clan personalities. Stages 6.56.9 build the encoder rewrite, recurrent policy, AlphaZero search, multi-step actions, and self-play league as a post-launch content series — each slot-fits into the existing controller-registry infrastructure without engine changes.

See ai-roadmap.md for the patch-by-patch narrative.


5-stage post-launch architecture roadmap

Engineering-side reference. Designer-facing narrative in ai-roadmap.md. Plan file: ~/.claude/plans/in-the-game-civilization-elegant-popcorn.md.

Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space

Replace the 32-float hand-rolled observation with a multi-modal encoder:

  • Spatial block: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16.
  • Scalar block: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance).
  • Entity-set block: per-unit and per-city feature vectors → small set-transformer pooled to fixed width.

Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) → concat → action head + value head. ~5M params, WASM-shippable via tract.

Companion changes:

  • Dynamic action space: load CITY_QUEUE_ITEMS from public/games/age-of-dwarves/data/buildings.json + units.json at training start. Removes the 16-item hardcoding.
  • Behavioral cloning warm-start: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min.
  • Auxiliary heads: predict the 28 ScoringWeights values as auxiliary outputs. Free supervision signal.

Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory

  • Switch to sb3-contrib RecurrentMaskablePPO. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns.
  • Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it.
  • tract supports LSTM ops; WASM binary ~2× current.

Stage 6.7 (v1.3) — AlphaZero search at inference

The single highest-leverage change. Engine hooks already exist (audit above).

  • Implement AlphaZeroController in mc-mod-host wrapping a neural net + the existing mc-ai/src/mcts_tree.rs PUCT search.
  • Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for (prior, value) evaluations at each expansion.
  • 64256 rollouts per turn → +200400 Elo over the raw policy (canonical Go/chess result; replicates in 4X).
  • The 28 ScoringWeights become the initial prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately.

Stage 6.8 (v1.3) — Multi-step movement & strategic actions

  • Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks:
    • move_to(target_hex) — A* path planned by the simulator, executed multi-turn.
    • rally(target_hex) — set city/production-building rally point.
    • patrol(waypoints) — repeat-cycle scouting.
    • escort(unit_id) — move with a friendly unit.
  • Already partially exist: TacticalUnit.patrol_order field; gdext set_rally request. Plumbing surfaces them in legal_actions + encoders.py.
  • Action space grows 322 → ~800; masking handles per-step legality.

Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster

See "Specialization via reward shaping" and "Difficulty system" sections above for the roster and ladder. League pipeline:

  1. Freeze whatever 6.56.8 produces as learned:league-gen0.
  2. Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo.
  3. Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat.
  4. Gen ≥ 5 → strong generalist. Round-robin tournament picks champion.

Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM, ~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates in a workday.


Cross-references

  • Modder contract: docs/modding/ai-controller.md
  • ABI decisions: docs/modding/abi-decisions.md
  • Plan file: ~/.claude/plans/in-the-game-civilization-elegant-popcorn.md
  • AiController trait: src/simulator/crates/mc-player-api/src/controllers.rs
  • Reward shape: tooling/rl_self_play/magic_civ_env.py
  • Observation encoder: tooling/rl_self_play/encoders.py