autocommit a54a873903 docs(docs): 📝 implement 5-stage post-launch roadmap for AI production documentation with planning, deployment, monitoring, scaling, and optimization phases

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-05-26 02:21:13 -07:00

16 KiB

Raw Permalink Blame History

AI Production Guide — Magic Civilization

How the game ships AI, how learned policies are trained, how to add a new specialist, and how difficulty levels are constructed. This is the designer/engineering reference. The community-facing modder contract is docs/modding/ai-controller.md.

TL;DR

Two controller families, both selectable per slot:
- scripted:* — MCTS + heuristic AI from the mc-ai crate. Transparent, hand-tunable, fast. Anchors the named clan personalities.
- learned:* — neural policy trained with MaskablePPO. Strong, opaque. Anchors high-difficulty tiers and tournament play.
Difficulty is orthogonal to controller choice — handicaps + policy temperature stacked on top of either family.
Specialization (rush / turtle / tech / economy) is via different reward functions on the same architecture, each a separate best_model.zip shipped as its own controller.
Strong-AI ceiling is raised by AlphaZero search at inference + the 12-FFA self-play league (Stage 6.7 + 6.9, post-launch).

Coverage matrix — what each AI actually knows

This is the load-bearing diagnostic. The current learned:duel-v1b ships with a 32-float hand-rolled observation vector that throws away most of the engine's state. The scripted AI reads everything via TacticalState + 28 ScoringWeights.

Concept	Scripted (`mc-ai` + 28 weights)	Learned (`encoders.py`)
Terrain biome / substrate / yields per tile	✓ `TacticalTile` (state.rs:76-90)	✗ encoder discards all tile data (encoders.py:18)
Flora / fauna entities	✗ (not on wire)	✗
Production buildings	✓ full data-driven catalog	✗ hardcoded 16-item list (`CITY_QUEUE_ITEMS`)
Research / tech tree	✓ via gate prerequisites	✗ only `science_per_turn`
Strategic resource gating	✓ `strategic_resources`	✗ never reads stockpile
Rally / patrol / scout actions	△ stored, not actively issued	✗ not in action space
Diplomacy detail + personality	✓ 6 strategic axes per opponent	✗ only war/peace/borders counts
Tiles worked per city	✓ `tiles_worked`	✗
Multi-step pathfinding at decision time	✗ 1-action lookahead	✗ 1-action lookahead

The engine is not the bottleneck. PlayerView already exposes every piece of state in the left column (TileView carries biome / substrate / river / improvement / visible / explored; CityView.buildable[] carries the full catalog; ResearchView carries the whole tech tree; per-opponent DiplomacyView is on the wire). The encoder is the bottleneck.

This matrix drives the 5-stage roadmap in ai-roadmap.md.

AlphaZero-readiness audit (2026-05-18)

The codebase is already structured for an AlphaZero-grade learned AI; the hooks exist but nothing is plugged into them.

Hook	Location	Status
PUCT tree MCTS with action priors + value-head rollout	`mc-ai/src/mcts_tree.rs:62-249`	✓ ready; both `TreeState::action_prior()` and `TreeState::rollout()` are overridable
Per-tile spatial state (CNN-ready)	`mc-player-api/src/view.rs:191-212`	✓ all channels present in `TileView`
Controller registry trait	`mc-player-api/src/controllers.rs:58-150`	✓ a future `AlphaZeroController` plugs in alongside scripted/learned
28 evaluator weights as auxiliary loss targets	`mc-core/src/scoring_weights.rs:174`	✓ N=28 scalar fields, ready-made supervision signal
Fog-of-war + visibility filter	`mc-observation/src/fog.rs` + `mc-vision::compute_vision()`	✓ wired into projection; policy never sees data it shouldn't

How a learned policy actually works

Not seed search. The seed sets the RNG for weight initialization + environment rollout order. Different seeds produce different local optima of the same learning process; we run multiple seeds because PPO is high-variance.

Weight optimization via gradient descent. Concretely:

Policy network. A small MLP (~2 hidden layers, ~64 units) maps observation (32 floats) → action distribution (322 logits). Weights start random.
Rollout. Policy samples action a from current state s; environment returns reward r and next state s'. Collect ~512 such transitions in a buffer.
Advantage. A critic network predicts expected return per state. Advantage A(s, a) = actual_return − critic_prediction. Positive advantage = action was better than baseline; negative = worse.
PPO update. Gradient-ascend the policy weights to make positive- advantage actions more probable, negative ones less, clipped so a single update can't move probabilities more than 20% (the "proximal" in PPO).
Repeat for 250k–1M environment steps. Weights drift from random to "actions that win games."

Three parallel seeds = three independent fits. We ship the best by tournament win-rate; the others are discarded.

Action masking. MaskablePPO multiplies action logits by a legal-action mask before sampling — the policy can never propose an illegal action. Mask comes from encode_legal_actions() in tooling/rl_self_play/encoders.py.

Controller families

`scripted:*`

Controller ID	Use
`scripted:default`	The general-purpose MCTS+heuristic AI; default for unknown ids.
`scripted:warmonger`	Personality: war-weight 2.0, expansion-weight 1.5.
`scripted:builder`	Personality: economy-weight 2.0, war-weight 0.5.
`scripted:tinkersmith`	Personality: tech-weight 2.5, military-weight 1.0.
`scripted:peaceful`	Personality: war-weight 0.3, diplomacy-weight 2.0.
`scripted:opportunist`	Personality: dynamic re-weighting from situation.

Personalities are pure data in public/games/age-of-dwarves/data/ai_personalities.json. Adding a new one is a JSON edit; no Rust changes.

`learned:*`

Controller ID	Use
`learned:duel-v1b`	First in-box learned mod (Stage 6). Trained on duel maps vs scripted baseline; generalist.
`learned:rush` (Stage 6.5)	Reward-shaped for early military pressure.
`learned:turtle` (6.5)	Reward-shaped for defensive consolidation.
`learned:tech` (6.5)	Reward-shaped for research throughput.
`learned:economy` (6.5)	Reward-shaped for gold + city count.
`learned:league-genN` (6.5)	Self-play league generations.

Each learned:* ships as its own .wasm mod (~400 KB after ONNX → tract compile). Native .so/.dll/.dylib variants ship signed for users opting into the faster path.

Specialization via reward shaping

Same network architecture (encoders.py + 2-layer MLP). Different reward function trained with the same train.py loop. Each variant produces a separate best_model.zip registered as a distinct controller.

Baseline reward (current magic_civ_env.py):

+1.0   on win  (game_over event, winner == me)
-1.0   on loss (game_over event, winner != me)
+1e-2  per turn advance
+1e-3  per score_estimate delta
-5e-4  per step (anti-stalling)

Specialist overlays (added on top of baseline):

Variant	Extra reward terms
`rush`	`+0.5` per enemy unit killed before turn 80; `-1e-2` per turn after turn 80
`turtle`	`+0.05` per friendly unit fortified-on-defense-tile; `+0.1` per wall built
`tech`	`+5e-3` per `science_per_turn` delta; `+0.3` per tech unlocked
`economy`	`+1e-3` per gold-reserve delta; `+0.5` per city founded

Tuning rule: extras must sum to less than the terminal ±1.0 across a typical game, otherwise the policy learns the shaping signal instead of winning. Validate per-variant: evaluate.py must show win-rate ≥ baseline when the specialist is used, not just "the specialist's shaping signal is higher."

Adding a new specialist:

Add an entry to magic_civ_env.py::RewardOverlay enum + the shaping logic in step().
Run train.py --reward-overlay <name> --total-steps 250000 --seed 7.
Evaluate vs scripted:default and vs learned:duel-v1b.
If win-rate ≥ 0.55 against both, ship as learned:<name>.

Difficulty system

Difficulty is never "a weaker neural net." Two orthogonal levers:

1. Resource handicaps

Per-difficulty multipliers in public/games/age-of-dwarves/data/difficulty.json (schema TBD):

{
  "id": "settler",
  "human_resource_mul": 1.0,
  "ai_resource_mul": 0.7,
  "ai_unit_xp_bonus": 0
}

Applied at city-yield + unit-creation time in mc-economy.

2. Policy temperature

For learned:* controllers, a temperature: f32 field on the controller config divides the logits before sampling:

softmax(logits / T)

T = 1.0 — base policy.
T > 1.0 — softer distribution, more random, easier.
T < 1.0 — sharper, near-greedy, harder.
T → 0 — argmax (deterministic).

Implementation: ~10 LOC in WasmAiController::decide_turn (apply scaling before the wasm guest samples, OR pass T through as a guest parameter and let the guest apply it). Stage 6.5 work.

Recommended Game 1 ladder

Difficulty	Controller	T	Handicap
Settler	`scripted:peaceful`	n/a	AI ×0.7
Chieftain	`scripted:default`	n/a	none
Warlord	`scripted:*` rotating	n/a	none
King	`learned:league-best`	1.5	none
Champion	`learned:league-best`	0.3	AI ×1.3

Training infrastructure

Hosts

Edit host (mac): authoring; never trains.
Run host (apricot.lan): 64-core Threadripper, 94 GB RAM, 2×3090. All training runs here.
Plum: screenshot capture only; no training.

Layout

tooling/rl_self_play/
├── train.py              # PPO loop, sb3-contrib MaskablePPO
├── evaluate.py           # Hard win-rate measurement
├── magic_civ_env.py      # Gymnasium wrapper + reward shaping
├── encoders.py           # PlayerView ↔ obs/action tensors
├── harness_client.py     # JSON-Lines subprocess to Godot headless
├── models/<run-name>/    # best_model.zip per training run
└── runs/<run-name>/      # tensorboard event files

tooling/rl_self_play/models/ and runs/ are gitignored (bulky; not artifacts of the source repo).

Single-game training (duel)

ssh apricot.lan
cd ~/Code/@projects/@magic-civilization
python -m tooling.rl_self_play.train \
  --run-name duel-v1b \
  --total-steps 250000 \
  --num-envs 16 \
  --seed 7 \
  --device cuda:1

--num-envs N runs N parallel headless Godot subprocesses; sb3-contrib SubprocVecEnv lock-steps them. Scaling is sub-linear because env-step is I/O-bound on JSON-Lines, not GPU-bound (the policy net is tiny). Past 16 envs per training run, returns diminish.

Parallel seed runs

Three independent seeds in parallel claim ~3 × 16 = 48 worker subprocesses on apricot. Memory headroom: each Godot headless ~600 MB, so ~30 GB total — fits inside 94 GB with margin for the OS + other services.

12-FFA self-play league (Stage 6.5)

python -m tooling.rl_self_play.train \
  --run-name league-gen1 \
  --map-type 12ffa-huge \
  --opponent-pool models/league/gen0/best_model.zip \
  --total-steps 1000000 \
  --num-envs 4 \
  --seed 7 \
  --device cuda:1

12-slot games are ~10× a duel in wall-clock per env, BUT GPU is not the bottleneck (the policy is a ~50k-param MLP). Verified on apricot 2026-05-18: 8 concurrent 12-FFA envs ≈ 5 GB RAM, ~12 cores, both GPUs at < 5% utilization. 1M steps ≈ 3.5h per league generation.

Save format & forward compatibility

Every save records the controller_id AND controller_hash per slot (SaveEnvelope v2). Loading a save with a controller the current install doesn't have yields a friendly error from save_manager.gd::_validate_controllers_after_load, not a crash mid-game.

Mod authors: never reuse a controller_id across incompatible weight versions. Bump the version suffix (learned:duel-v1c, not learned:duel-v1b) or saves from the old binary will mis-attribute to the new one.

Ship-then-improve

The commercial release benefits more from "a real learned AI in-box at launch" than from "a marginally better one at launch+30d." Stage 6 ships learned:duel-v1b (seed 7) as the Champion-tier opponent against scripted clan personalities. Stages 6.5–6.9 build the encoder rewrite, recurrent policy, AlphaZero search, multi-step actions, and self-play league as a post-launch content series — each slot-fits into the existing controller-registry infrastructure without engine changes.

See ai-roadmap.md for the patch-by-patch narrative.

5-stage post-launch architecture roadmap

Engineering-side reference. Designer-facing narrative in ai-roadmap.md. Plan file: ~/.claude/plans/in-the-game-civilization-elegant-popcorn.md.

Stage 6.5 (v1.1) — Encoder rewrite + dynamic action space

Replace the 32-float hand-rolled observation with a multi-modal encoder:

Spatial block: 60×60×K float tensor; channels {own_unit, enemy_unit, own_city, enemy_city, biome_id, substrate_id, river, improvement_id, fog, explored, resource_present, ...}. K ≈ 16.
Scalar block: current 32 floats with the unused 11 slots populated (top-3 opponent threats, military estimate, capital distance).
Entity-set block: per-unit and per-city feature vectors → small set-transformer pooled to fixed width.

Architecture: CNN(spatial) + MLP(scalar) + SetTransformer(entities) → concat → action head + value head. ~5M params, WASM-shippable via tract.

Companion changes:

Dynamic action space: load CITY_QUEUE_ITEMS from public/games/age-of-dwarves/data/buildings.json + units.json at training start. Removes the 16-item hardcoding.
Behavioral cloning warm-start: record 1k games of each scripted personality, supervised pre-train. Cold-start to ~50% baseline policy in ~30 min.
Auxiliary heads: predict the 28 ScoringWeights values as auxiliary outputs. Free supervision signal.

Stage 6.6 (v1.2) — Recurrent policy + per-opponent memory

Switch to sb3-contrib RecurrentMaskablePPO. LSTM head (~128 hidden) between encoder and action head. Hidden state = session memory across turns.
Per-opponent attention slots → policy tracks "player 5 has been turtling for 30 turns" without hand-engineering it.
tract supports LSTM ops; WASM binary ~2× current.

Stage 6.7 (v1.3) — AlphaZero search at inference

The single highest-leverage change. Engine hooks already exist (audit above).

Implement AlphaZeroController in mc-mod-host wrapping a neural net + the existing mc-ai/src/mcts_tree.rs PUCT search.
Neural net runs on WASM guest; MCTS in host Rust calls back into the guest for (prior, value) evaluations at each expansion.
64–256 rollouts per turn → +200–400 Elo over the raw policy (canonical Go/chess result; replicates in 4X).
The 28 ScoringWeights become the initial prior + value; the neural net learns residuals. Even an undertrained net plays at scripted strength immediately.

Stage 6.8 (v1.3) — Multi-step movement & strategic actions

Expand per-unit action vocabulary beyond the 12 single-hex moves/attacks:
- move_to(target_hex) — A* path planned by the simulator, executed multi-turn.
- rally(target_hex) — set city/production-building rally point.
- patrol(waypoints) — repeat-cycle scouting.
- escort(unit_id) — move with a friendly unit.
Already partially exist: TacticalUnit.patrol_order field; gdext set_rally request. Plumbing surfaces them in legal_actions + encoders.py.
Action space grows 322 → ~800; masking handles per-step legality.

Stage 6.9 (v1.4) — 12-FFA self-play league + specialist roster

See "Specialization via reward shaping" and "Difficulty system" sections above for the roster and ladder. League pipeline:

Freeze whatever 6.5–6.8 produces as learned:league-gen0.
Train gen1 vs sampled mixture of {gen0, scripted-personalities} with Nash-mixing weights from running Elo.
Freeze gen1; train gen2 vs {gen0, gen1, scripted}. Repeat.
Gen ≥ 5 → strong generalist. Round-robin tournament picks champion.

Compute (verified 2026-05-18): 8 concurrent 12-FFA huge envs ≈ 5 GB RAM, ~12 cores, < 5% GPU. 1M steps ≈ 3.5h per generation. Gen0 → gen5 iterates in a workday.

Cross-references

Modder contract: docs/modding/ai-controller.md
ABI decisions: docs/modding/abi-decisions.md
Plan file: ~/.claude/plans/in-the-game-civilization-elegant-popcorn.md
AiController trait: src/simulator/crates/mc-player-api/src/controllers.rs
Reward shape: tooling/rl_self_play/magic_civ_env.py
Observation encoder: tooling/rl_self_play/encoders.py

16 KiB Raw Permalink Blame History Unescape Escape