docs(objectives): 📝 Add documentation for RL divergence mining objective (p1-29e)

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-05-27 20:26:00 -07:00 · 2026-05-27 20:26:00 -07:00 · b7353d43e2
commit b7353d43e2
parent dbeb3f4088
1 changed files with 130 additions and 0 deletions
--- a/.project/objectives/p1-29e-rl-divergence-mining.md
+++ b/.project/objectives/p1-29e-rl-divergence-mining.md
@ -0,0 +1,130 @@
+---
+id: p1-29e-rl-divergence-mining
+title: "RL-policy divergence mining → sole-city economy break-out (production, not science)"
+priority: p1
+status: partial
+scope: game1
+owner: warcouncil
+tags: [ai, rl, balance, production, tech]
+updated_at: 2026-05-27
+relates_to: [p1-29c, p1-29d]
+evidence:
+  - "tooling/rl_self_play/mine_divergence.py — reusable RL-vs-scripted divergence miner (scripted-vs-scripted trajectory + policy probe on identical PlayerView)"
+  - "tooling/rl_self_play/runs/duel-v4-encfix-s7/divergence_slot1.json — 600 probed decisions, 10 seeds"
+  - "src/simulator/crates/mc-ai/src/tactical/production.rs — SOLE_CITY_ECON_* consts + step-0 economy break-out in pick_for_city; 4 new unit tests (cargo test -p mc-ai --lib: 265 pass)"
+---
+
+## What this is
+
+Dispatch task: mine the trained MaskablePPO policy `duel-v4-encfix-s7`
+(+5.5 mean reward vs the scripted MCTS baseline) for behavioural diffs vs
+the scripted AI, and turn them into concrete heuristic-patch candidates
+targeting the two in-flight blockers p1-29c (sole-city research path) and
+p1-29d (trailing-AI survival before T100).
+
+## Method
+
+`tooling/rl_self_play/mine_divergence.py` drives a scripted-vs-scripted
+duel (both slots on the built-in `suggest()` chain — the same trajectory
+p1-29c/29d measure against) and, at the start of every turn, probes the
+trained policy on the **identical** `PlayerView` without applying its
+choice. Divergences are logged and bucketed by board state. 600 decisions
+across seeds 1–10, probing the trailing slot.
+
+## Findings
+
+### F1 — The policy's action space cannot pick research at all
+`tooling/rl_self_play/encoders.py` exposes only `end_turn`, per-unit micro
+(skip / fortify / sentry / found_city / move / attack) and per-city
+`queue_production` over a 16-item building roster. **There is no research
+action.** p1-29c's entire "research-priority uplift" framing is orthogonal
+to anything the winning policy could even do. The only economy lever the
+policy has is the build queue.
+
+### F2 — The winning policy's economy lever is PRODUCTION, never science
+Across all 600 probed decisions and all four board-state buckets, the
+policy's build preference is exclusively **`forge` (production) + `warrior`
+(military)**. It chose a science/culture building (`library`,
+`chronicle_tower`, `high_guild_hall`) **zero times**. This survives the
+argmax-ordering confound: ordering noise can swap which action shows first,
+it cannot manufacture a complete absence across 600 decisions. The winning
+play is a production multiplier, not science infrastructure.
+
+### F3 — The trailing scripted AI builds ZERO buildings (root cause)
+Apricot batch `20260516_183534` (the p1-29d 0/10 baseline), P1 per-seed:
+
+| seed | end turn | outcome  | P0 tp / cities / buildings | P1 tp / cities / **buildings (max ever)** |
+|------|----------|----------|----------------------------|-------------------------------------------|
+| 1    | 63       | victory  | 2 / 2 / 3                  | 1 / 0 / **0** |
+| 2    | 44       | victory  | 2 / 3 / 6                  | 1 / 0 / **0** |
+| 3    | 152      | victory  | 6 / 2 / 43                 | 1 / 0 / **0** |
+| 4    | 100      | victory  | 2 / 2 / 3                  | 1 / 0 / **0** |
+| 5    | 300      | victory  | 3 / 3 / 38                 | 1 / 1 / **0** (alive, never developed) |
+| 6    | 76       | victory  | 2 / 2 / 3                  | 1 / 0 / **0** |
+| 7    | 203      | victory  | 7 / 3 / 21                 | 1 / 0 / **0** |
+| 8    | 66       | victory  | 2 / 2 / 3                  | 1 / 0 / **0** |
+| 9    | 278      | in_prog. | 10 / 3 / 56                | 1 / 1 / **0** (alive, never developed) |
+| 10   | 57       | victory  | 2 / 2 / 3                  | 1 / 0 / **0** |
+
+P1 builds **zero buildings in 10/10 seeds**, including the two seeds where
+it survives to T278/T300 with a standing city. With no economy buildings it
+cannot reach `tier_peak ≥ 2`.
+
+### F4 — The cause is a structural short-circuit, and the prior science uplift is dead code
+In `pick_for_city`, `classify_posture(threatened=true, …)` returns
+`Posture::Threatened`, and **step 1 returns a melee unit every turn** —
+before the early-mil floor, before expansion, and before the step-7
+building scorer. A trailing sole-city AI is *always* threatened
+(`enemy_mil_max > own_mil`), so it loops on military forever. Consequently
+the p1-29b/29c `sole_city_threatened` science uplift in `score_building`
+(step 7) **was never reached** in its target regime — it is dead code. No
+prior intervention actually changed the ladder ordering: p1-29c tuned
+action priors (`action_prior_with_context`), p1-29d tuned combat damage
+(`solo_city_grace`); neither let the trailing AI build economy.
+
+## Ranked rule candidates
+
+1. **(IMPLEMENTED) Sole-city economy break-out.** When `sole_city_threatened`
+   and a minimal defender floor is met and the city holds fewer than N
+   buildings, interject **one production-category building** before the
+   step-1 military short-circuit. Directly closes the F3 gap with the F2
+   lever (production, not science). Lowest-risk: gated entirely on
+   `sole_city_threatened`, so no other player's ladder changes.
+2. **Drop / repurpose the dead science uplift.** The `sole_city_threatened`
+   1.5× science multiplier in `score_building` is unreachable; either remove
+   it (tech-debt) or fold it into the break-out's posture so the break-out
+   can prefer science once production is covered. Deferred — F2 says the
+   winning lever is production, so science-first is contraindicated.
+3. **Production-scale handicap (out of scope here).** p1-29d already
+   concluded a 1-city AI cannot out-build a 3-city opponent on volume; if the
+   break-out is insufficient, the next lever is mechanical (free defender on
+   threat) or map-placement, per p1-29d's closing notes — not a priors tweak.
+
+## Implemented patch
+
+`src/simulator/crates/mc-ai/src/tactical/production.rs`:
+- `SOLE_CITY_ECON_MIN_DEFENDERS = 2`, `SOLE_CITY_ECON_TARGET = 2` consts.
+- New step-0 in `pick_for_city`: for a threatened sole-city AI with ≥2
+  defenders and <2 buildings, return the best production/infrastructure
+  building from the catalog (catalog-driven, Rail 2 preserved).
+- 4 unit tests: break-out fires → forge; stops at target → warrior; needs
+  min defenders → warrior; multi-city unaffected → warrior. `cargo test -p
+  mc-ai --lib`: 265 pass.
+
+## Validation (before/after autoplay batch)
+
+Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10,
+`tier_peak = 1` in 10/10, 0/10 gate.
+
+After (this patch): _<pending — filled by the validation batch below>_
+
+```
+PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after
+# analyze: P1 buildings_max > 0 ?  P1 tier_peak >= 2 ?  P1 median survival turn ?
+```
+
+Per p1-29d iteration discipline: any directional movement (P1 builds >0
+buildings; survival-turn or tier_peak rises) confirms the lever direction
+even if the full ≥7/10 gate does not land in one pass. If the gate is still
+0–2/10, this objective stays `partial` and the next iteration targets the
+production-scale handicap (candidate 3), not another priors tweak.