diff --git a/.project/designs/richer-encoder-spec.md b/.project/designs/richer-encoder-spec.md new file mode 100644 index 00000000..91643b11 --- /dev/null +++ b/.project/designs/richer-encoder-spec.md @@ -0,0 +1,75 @@ +# Richer Observation Encoder — Spec (clan-conditioned) + +> Owner decisions (2026-06-30): clan-CONDITIONED single model; **block on a richer +> encoder before training**; land all Rust wiring first. This spec pins the new +> observation contract so `tooling/rl_self_play/encoders.py` and +> `src/simulator/crates/mc-player-api/src/learned/encoder.rs` land bit-exact in one +> verified batch. See [[mc-ai-trained-not-scripted]] and +> `.project/designs/trained-ai-per-clan-difficulty-plan.md`. + +## Why (the cap being removed) + +The current 32-float obs (`encode_observation`) is macro-only: it discards tech-tree +progress, per-city territory/buildings/siege state, army health/experience, terrain, +and civics — exactly the signals a competent 4X policy needs. This is the documented +competence cap. The action space (`ACTION_DIM`, mask) is **unchanged** — only the +observation grows. + +## Prerequisite (Rust, cargo-verifiable, do first) + +`PlayerView` exposes **no clan/personality**. Clan-conditioning needs it. +1. Add `pub clan_index: i32` to `PlayerView` (view.rs) — the bound player's clan as a + stable 0..5 index (`-1` = unset/generalist), mapped from `PlayerState.clan_id` + string via a canonical order `[militarist, boom, expansionist, merchant, + tech_rusher, turtle]` (the order in `ai_personalities.json`). +2. Project it in `projection.rs` where `PlayerView` is built. Generalist training/play + passes `-1` → all-zero one-hot. +3. Mirror the field in the Python view dict (it already serializes from the same wire). + +## New observation layout (richer DENSE; a spatial/CNN obs stays a later v1.1 step) + +`OBS_DIM = 96`. asinh-normalized uniformly (unchanged contract), f64→f32, parity ≤1e-4. +All counts/fractions sourced ONLY from the fog-aware `PlayerView` (no privileged state). + +| Block | Idx | Fields (source) | +|---|---|---| +| A self macro (keep) | 0:8 | gold, gold_per_turn, science_per_turn, score_estimate, city_count, unit_count, happiness_pool, culture_per_turn (resources/score) | +| B economy aggregate | 8:16 | Σcity food, Σprod, Σscience, Σgold, Σculture yields; avg pop; Σowned_tiles (territory); golden_age_active (cities[].yields, cities[].owned_tiles, resources.golden_age_active) | +| C tech/culture/civics | 16:24 | researched_count, tech_progress/tech_cost frac, available_count, tradition researched_count, tradition_progress frac, civics set-count (authority+labor+economy non-null), anarchy_turns, current_tech present (research/culture/civics) | +| D military | 24:34 | total units, warriors, founders, Σhp/Σmax_hp (army health frac), avg experience, #promotion_available, #fortified, #shield_wall‖braced posture, Σequipped, avg movement_left/movement_max (units[]) | +| E per-city (4×6) | 34:58 | per city c in cities[:4]: population, production yield, food_stored/threshold frac, hp/max_hp frac (siege), buildings count, owned_tiles count | +| F terrain summary | 58:72 | #visible tiles, explored fraction (explored/total seen), #river tiles, #improvement tiles, #owned tiles, biome histogram over a fixed biome vocab (top ~8 biomes counts) (tiles[]) | +| G diplomacy | 72:80 | #known opponents, #war, #peace, #open_borders, #shared_map, #trade_deals, #pending_envelopes, #ransom_offers (diplomacy/pending_events) | +| H turn | 80:82 | turn number, turn/1000 progress | +| I clan one-hot | 82:88 | 6-wide one-hot from `clan_index` (all-zero if -1 = generalist) | +| — reserve | 88:96 | zero-pad headroom for next additions without a third contract break | + +Fixed biome vocab for block F (canonical order, append-only): the biome label ids in +`public/games/age-of-dwarves/data/.../biomes` — pin the exact list in code, unknown +biomes fold into an "other" bucket so the dim is stable. + +## Difficulty lever (Rust, independent, cargo-verifiable) + +Per-slot temperature: replace the single global `MC_LEARNED_TEMPERATURE` read +(`dispatch.rs:1137`) with a per-slot value sourced from the difficulty tier on +`players[slot]`. Test: two slots → different sampling streams from the same logits. + +## Controller wiring + +Single `learned:clan-v1` id; clan_index + temperature are *inputs*, not separate ids. +`ident()` version string gains the trained-run tag for save/replay fingerprinting. + +## Parity + verification (on the fleet — no local cargo) + +1. Regenerate `learned_parity.rs` fixtures via the Python capture tool on real views + (now OBS_DIM=96 + clan). +2. `cargo test -p mc-player-api learned_parity` → float-exact obs, bit-exact mask. +3. Determinism: same seed twice → byte-identical action stream. +4. `encoders.py` self-test: encode a captured view, assert shape 96 + asinh applied. + +## Landing order (one verified batch) + +1. `clan_index` on PlayerView + projection (Rust). 2. Rewrite `encoders.py` to 96-dim. +3. Rewrite `encoder.rs::encode_observation` to match field-for-field. 4. Bump +`inference.rs` input shape 32→96. 5. Per-slot temperature. 6. Regen fixtures + +`cargo test learned_parity` + determinism, all green on a worker. 7. Commit.