diff --git a/.project/objectives/p1-29e-rl-divergence-mining.md b/.project/objectives/p1-29e-rl-divergence-mining.md new file mode 100644 index 00000000..e56371f9 --- /dev/null +++ b/.project/objectives/p1-29e-rl-divergence-mining.md @@ -0,0 +1,130 @@ +--- +id: p1-29e-rl-divergence-mining +title: "RL-policy divergence mining → sole-city economy break-out (production, not science)" +priority: p1 +status: partial +scope: game1 +owner: warcouncil +tags: [ai, rl, balance, production, tech] +updated_at: 2026-05-27 +relates_to: [p1-29c, p1-29d] +evidence: + - "tooling/rl_self_play/mine_divergence.py — reusable RL-vs-scripted divergence miner (scripted-vs-scripted trajectory + policy probe on identical PlayerView)" + - "tooling/rl_self_play/runs/duel-v4-encfix-s7/divergence_slot1.json — 600 probed decisions, 10 seeds" + - "src/simulator/crates/mc-ai/src/tactical/production.rs — SOLE_CITY_ECON_* consts + step-0 economy break-out in pick_for_city; 4 new unit tests (cargo test -p mc-ai --lib: 265 pass)" +--- + +## What this is + +Dispatch task: mine the trained MaskablePPO policy `duel-v4-encfix-s7` +(+5.5 mean reward vs the scripted MCTS baseline) for behavioural diffs vs +the scripted AI, and turn them into concrete heuristic-patch candidates +targeting the two in-flight blockers p1-29c (sole-city research path) and +p1-29d (trailing-AI survival before T100). + +## Method + +`tooling/rl_self_play/mine_divergence.py` drives a scripted-vs-scripted +duel (both slots on the built-in `suggest()` chain — the same trajectory +p1-29c/29d measure against) and, at the start of every turn, probes the +trained policy on the **identical** `PlayerView` without applying its +choice. Divergences are logged and bucketed by board state. 600 decisions +across seeds 1–10, probing the trailing slot. + +## Findings + +### F1 — The policy's action space cannot pick research at all +`tooling/rl_self_play/encoders.py` exposes only `end_turn`, per-unit micro +(skip / fortify / sentry / found_city / move / attack) and per-city +`queue_production` over a 16-item building roster. **There is no research +action.** p1-29c's entire "research-priority uplift" framing is orthogonal +to anything the winning policy could even do. The only economy lever the +policy has is the build queue. + +### F2 — The winning policy's economy lever is PRODUCTION, never science +Across all 600 probed decisions and all four board-state buckets, the +policy's build preference is exclusively **`forge` (production) + `warrior` +(military)**. It chose a science/culture building (`library`, +`chronicle_tower`, `high_guild_hall`) **zero times**. This survives the +argmax-ordering confound: ordering noise can swap which action shows first, +it cannot manufacture a complete absence across 600 decisions. The winning +play is a production multiplier, not science infrastructure. + +### F3 — The trailing scripted AI builds ZERO buildings (root cause) +Apricot batch `20260516_183534` (the p1-29d 0/10 baseline), P1 per-seed: + +| seed | end turn | outcome | P0 tp / cities / buildings | P1 tp / cities / **buildings (max ever)** | +|------|----------|----------|----------------------------|-------------------------------------------| +| 1 | 63 | victory | 2 / 2 / 3 | 1 / 0 / **0** | +| 2 | 44 | victory | 2 / 3 / 6 | 1 / 0 / **0** | +| 3 | 152 | victory | 6 / 2 / 43 | 1 / 0 / **0** | +| 4 | 100 | victory | 2 / 2 / 3 | 1 / 0 / **0** | +| 5 | 300 | victory | 3 / 3 / 38 | 1 / 1 / **0** (alive, never developed) | +| 6 | 76 | victory | 2 / 2 / 3 | 1 / 0 / **0** | +| 7 | 203 | victory | 7 / 3 / 21 | 1 / 0 / **0** | +| 8 | 66 | victory | 2 / 2 / 3 | 1 / 0 / **0** | +| 9 | 278 | in_prog. | 10 / 3 / 56 | 1 / 1 / **0** (alive, never developed) | +| 10 | 57 | victory | 2 / 2 / 3 | 1 / 0 / **0** | + +P1 builds **zero buildings in 10/10 seeds**, including the two seeds where +it survives to T278/T300 with a standing city. With no economy buildings it +cannot reach `tier_peak ≥ 2`. + +### F4 — The cause is a structural short-circuit, and the prior science uplift is dead code +In `pick_for_city`, `classify_posture(threatened=true, …)` returns +`Posture::Threatened`, and **step 1 returns a melee unit every turn** — +before the early-mil floor, before expansion, and before the step-7 +building scorer. A trailing sole-city AI is *always* threatened +(`enemy_mil_max > own_mil`), so it loops on military forever. Consequently +the p1-29b/29c `sole_city_threatened` science uplift in `score_building` +(step 7) **was never reached** in its target regime — it is dead code. No +prior intervention actually changed the ladder ordering: p1-29c tuned +action priors (`action_prior_with_context`), p1-29d tuned combat damage +(`solo_city_grace`); neither let the trailing AI build economy. + +## Ranked rule candidates + +1. **(IMPLEMENTED) Sole-city economy break-out.** When `sole_city_threatened` + and a minimal defender floor is met and the city holds fewer than N + buildings, interject **one production-category building** before the + step-1 military short-circuit. Directly closes the F3 gap with the F2 + lever (production, not science). Lowest-risk: gated entirely on + `sole_city_threatened`, so no other player's ladder changes. +2. **Drop / repurpose the dead science uplift.** The `sole_city_threatened` + 1.5× science multiplier in `score_building` is unreachable; either remove + it (tech-debt) or fold it into the break-out's posture so the break-out + can prefer science once production is covered. Deferred — F2 says the + winning lever is production, so science-first is contraindicated. +3. **Production-scale handicap (out of scope here).** p1-29d already + concluded a 1-city AI cannot out-build a 3-city opponent on volume; if the + break-out is insufficient, the next lever is mechanical (free defender on + threat) or map-placement, per p1-29d's closing notes — not a priors tweak. + +## Implemented patch + +`src/simulator/crates/mc-ai/src/tactical/production.rs`: +- `SOLE_CITY_ECON_MIN_DEFENDERS = 2`, `SOLE_CITY_ECON_TARGET = 2` consts. +- New step-0 in `pick_for_city`: for a threatened sole-city AI with ≥2 + defenders and <2 buildings, return the best production/infrastructure + building from the catalog (catalog-driven, Rail 2 preserved). +- 4 unit tests: break-out fires → forge; stops at target → warrior; needs + min defenders → warrior; multi-city unaffected → warrior. `cargo test -p + mc-ai --lib`: 265 pass. + +## Validation (before/after autoplay batch) + +Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10, +`tier_peak = 1` in 10/10, 0/10 gate. + +After (this patch): __ + +``` +PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after +# analyze: P1 buildings_max > 0 ? P1 tier_peak >= 2 ? P1 median survival turn ? +``` + +Per p1-29d iteration discipline: any directional movement (P1 builds >0 +buildings; survival-turn or tier_peak rises) confirms the lever direction +even if the full ≥7/10 gate does not land in one pass. If the gate is still +0–2/10, this objective stays `partial` and the next iteration targets the +production-scale handicap (candidate 3), not another priors tweak.