docs(objectives): 📝 Add documentation for RL divergence mining objective (p1-29e)

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-05-27 20:26:00 -07:00
parent dbeb3f4088
commit b7353d43e2

View file

@ -0,0 +1,130 @@
---
id: p1-29e-rl-divergence-mining
title: "RL-policy divergence mining → sole-city economy break-out (production, not science)"
priority: p1
status: partial
scope: game1
owner: warcouncil
tags: [ai, rl, balance, production, tech]
updated_at: 2026-05-27
relates_to: [p1-29c, p1-29d]
evidence:
- "tooling/rl_self_play/mine_divergence.py — reusable RL-vs-scripted divergence miner (scripted-vs-scripted trajectory + policy probe on identical PlayerView)"
- "tooling/rl_self_play/runs/duel-v4-encfix-s7/divergence_slot1.json — 600 probed decisions, 10 seeds"
- "src/simulator/crates/mc-ai/src/tactical/production.rs — SOLE_CITY_ECON_* consts + step-0 economy break-out in pick_for_city; 4 new unit tests (cargo test -p mc-ai --lib: 265 pass)"
---
## What this is
Dispatch task: mine the trained MaskablePPO policy `duel-v4-encfix-s7`
(+5.5 mean reward vs the scripted MCTS baseline) for behavioural diffs vs
the scripted AI, and turn them into concrete heuristic-patch candidates
targeting the two in-flight blockers p1-29c (sole-city research path) and
p1-29d (trailing-AI survival before T100).
## Method
`tooling/rl_self_play/mine_divergence.py` drives a scripted-vs-scripted
duel (both slots on the built-in `suggest()` chain — the same trajectory
p1-29c/29d measure against) and, at the start of every turn, probes the
trained policy on the **identical** `PlayerView` without applying its
choice. Divergences are logged and bucketed by board state. 600 decisions
across seeds 110, probing the trailing slot.
## Findings
### F1 — The policy's action space cannot pick research at all
`tooling/rl_self_play/encoders.py` exposes only `end_turn`, per-unit micro
(skip / fortify / sentry / found_city / move / attack) and per-city
`queue_production` over a 16-item building roster. **There is no research
action.** p1-29c's entire "research-priority uplift" framing is orthogonal
to anything the winning policy could even do. The only economy lever the
policy has is the build queue.
### F2 — The winning policy's economy lever is PRODUCTION, never science
Across all 600 probed decisions and all four board-state buckets, the
policy's build preference is exclusively **`forge` (production) + `warrior`
(military)**. It chose a science/culture building (`library`,
`chronicle_tower`, `high_guild_hall`) **zero times**. This survives the
argmax-ordering confound: ordering noise can swap which action shows first,
it cannot manufacture a complete absence across 600 decisions. The winning
play is a production multiplier, not science infrastructure.
### F3 — The trailing scripted AI builds ZERO buildings (root cause)
Apricot batch `20260516_183534` (the p1-29d 0/10 baseline), P1 per-seed:
| seed | end turn | outcome | P0 tp / cities / buildings | P1 tp / cities / **buildings (max ever)** |
|------|----------|----------|----------------------------|-------------------------------------------|
| 1 | 63 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
| 2 | 44 | victory | 2 / 3 / 6 | 1 / 0 / **0** |
| 3 | 152 | victory | 6 / 2 / 43 | 1 / 0 / **0** |
| 4 | 100 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
| 5 | 300 | victory | 3 / 3 / 38 | 1 / 1 / **0** (alive, never developed) |
| 6 | 76 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
| 7 | 203 | victory | 7 / 3 / 21 | 1 / 0 / **0** |
| 8 | 66 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
| 9 | 278 | in_prog. | 10 / 3 / 56 | 1 / 1 / **0** (alive, never developed) |
| 10 | 57 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
P1 builds **zero buildings in 10/10 seeds**, including the two seeds where
it survives to T278/T300 with a standing city. With no economy buildings it
cannot reach `tier_peak ≥ 2`.
### F4 — The cause is a structural short-circuit, and the prior science uplift is dead code
In `pick_for_city`, `classify_posture(threatened=true, …)` returns
`Posture::Threatened`, and **step 1 returns a melee unit every turn**
before the early-mil floor, before expansion, and before the step-7
building scorer. A trailing sole-city AI is *always* threatened
(`enemy_mil_max > own_mil`), so it loops on military forever. Consequently
the p1-29b/29c `sole_city_threatened` science uplift in `score_building`
(step 7) **was never reached** in its target regime — it is dead code. No
prior intervention actually changed the ladder ordering: p1-29c tuned
action priors (`action_prior_with_context`), p1-29d tuned combat damage
(`solo_city_grace`); neither let the trailing AI build economy.
## Ranked rule candidates
1. **(IMPLEMENTED) Sole-city economy break-out.** When `sole_city_threatened`
and a minimal defender floor is met and the city holds fewer than N
buildings, interject **one production-category building** before the
step-1 military short-circuit. Directly closes the F3 gap with the F2
lever (production, not science). Lowest-risk: gated entirely on
`sole_city_threatened`, so no other player's ladder changes.
2. **Drop / repurpose the dead science uplift.** The `sole_city_threatened`
1.5× science multiplier in `score_building` is unreachable; either remove
it (tech-debt) or fold it into the break-out's posture so the break-out
can prefer science once production is covered. Deferred — F2 says the
winning lever is production, so science-first is contraindicated.
3. **Production-scale handicap (out of scope here).** p1-29d already
concluded a 1-city AI cannot out-build a 3-city opponent on volume; if the
break-out is insufficient, the next lever is mechanical (free defender on
threat) or map-placement, per p1-29d's closing notes — not a priors tweak.
## Implemented patch
`src/simulator/crates/mc-ai/src/tactical/production.rs`:
- `SOLE_CITY_ECON_MIN_DEFENDERS = 2`, `SOLE_CITY_ECON_TARGET = 2` consts.
- New step-0 in `pick_for_city`: for a threatened sole-city AI with ≥2
defenders and <2 buildings, return the best production/infrastructure
building from the catalog (catalog-driven, Rail 2 preserved).
- 4 unit tests: break-out fires → forge; stops at target → warrior; needs
min defenders → warrior; multi-city unaffected → warrior. `cargo test -p
mc-ai --lib`: 265 pass.
## Validation (before/after autoplay batch)
Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10,
`tier_peak = 1` in 10/10, 0/10 gate.
After (this patch): _<pending filled by the validation batch below>_
```
PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after
# analyze: P1 buildings_max > 0 ? P1 tier_peak >= 2 ? P1 median survival turn ?
```
Per p1-29d iteration discipline: any directional movement (P1 builds >0
buildings; survival-turn or tier_peak rises) confirms the lever direction
even if the full ≥7/10 gate does not land in one pass. If the gate is still
02/10, this objective stays `partial` and the next iteration targets the
production-scale handicap (candidate 3), not another priors tweak.