docs(objectives): 📝 Add documentation for RL divergence mining objective (p1-29e)
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
dbeb3f4088
commit
b7353d43e2
1 changed files with 130 additions and 0 deletions
130
.project/objectives/p1-29e-rl-divergence-mining.md
Normal file
130
.project/objectives/p1-29e-rl-divergence-mining.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
---
|
||||
id: p1-29e-rl-divergence-mining
|
||||
title: "RL-policy divergence mining → sole-city economy break-out (production, not science)"
|
||||
priority: p1
|
||||
status: partial
|
||||
scope: game1
|
||||
owner: warcouncil
|
||||
tags: [ai, rl, balance, production, tech]
|
||||
updated_at: 2026-05-27
|
||||
relates_to: [p1-29c, p1-29d]
|
||||
evidence:
|
||||
- "tooling/rl_self_play/mine_divergence.py — reusable RL-vs-scripted divergence miner (scripted-vs-scripted trajectory + policy probe on identical PlayerView)"
|
||||
- "tooling/rl_self_play/runs/duel-v4-encfix-s7/divergence_slot1.json — 600 probed decisions, 10 seeds"
|
||||
- "src/simulator/crates/mc-ai/src/tactical/production.rs — SOLE_CITY_ECON_* consts + step-0 economy break-out in pick_for_city; 4 new unit tests (cargo test -p mc-ai --lib: 265 pass)"
|
||||
---
|
||||
|
||||
## What this is
|
||||
|
||||
Dispatch task: mine the trained MaskablePPO policy `duel-v4-encfix-s7`
|
||||
(+5.5 mean reward vs the scripted MCTS baseline) for behavioural diffs vs
|
||||
the scripted AI, and turn them into concrete heuristic-patch candidates
|
||||
targeting the two in-flight blockers p1-29c (sole-city research path) and
|
||||
p1-29d (trailing-AI survival before T100).
|
||||
|
||||
## Method
|
||||
|
||||
`tooling/rl_self_play/mine_divergence.py` drives a scripted-vs-scripted
|
||||
duel (both slots on the built-in `suggest()` chain — the same trajectory
|
||||
p1-29c/29d measure against) and, at the start of every turn, probes the
|
||||
trained policy on the **identical** `PlayerView` without applying its
|
||||
choice. Divergences are logged and bucketed by board state. 600 decisions
|
||||
across seeds 1–10, probing the trailing slot.
|
||||
|
||||
## Findings
|
||||
|
||||
### F1 — The policy's action space cannot pick research at all
|
||||
`tooling/rl_self_play/encoders.py` exposes only `end_turn`, per-unit micro
|
||||
(skip / fortify / sentry / found_city / move / attack) and per-city
|
||||
`queue_production` over a 16-item building roster. **There is no research
|
||||
action.** p1-29c's entire "research-priority uplift" framing is orthogonal
|
||||
to anything the winning policy could even do. The only economy lever the
|
||||
policy has is the build queue.
|
||||
|
||||
### F2 — The winning policy's economy lever is PRODUCTION, never science
|
||||
Across all 600 probed decisions and all four board-state buckets, the
|
||||
policy's build preference is exclusively **`forge` (production) + `warrior`
|
||||
(military)**. It chose a science/culture building (`library`,
|
||||
`chronicle_tower`, `high_guild_hall`) **zero times**. This survives the
|
||||
argmax-ordering confound: ordering noise can swap which action shows first,
|
||||
it cannot manufacture a complete absence across 600 decisions. The winning
|
||||
play is a production multiplier, not science infrastructure.
|
||||
|
||||
### F3 — The trailing scripted AI builds ZERO buildings (root cause)
|
||||
Apricot batch `20260516_183534` (the p1-29d 0/10 baseline), P1 per-seed:
|
||||
|
||||
| seed | end turn | outcome | P0 tp / cities / buildings | P1 tp / cities / **buildings (max ever)** |
|
||||
|------|----------|----------|----------------------------|-------------------------------------------|
|
||||
| 1 | 63 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
|
||||
| 2 | 44 | victory | 2 / 3 / 6 | 1 / 0 / **0** |
|
||||
| 3 | 152 | victory | 6 / 2 / 43 | 1 / 0 / **0** |
|
||||
| 4 | 100 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
|
||||
| 5 | 300 | victory | 3 / 3 / 38 | 1 / 1 / **0** (alive, never developed) |
|
||||
| 6 | 76 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
|
||||
| 7 | 203 | victory | 7 / 3 / 21 | 1 / 0 / **0** |
|
||||
| 8 | 66 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
|
||||
| 9 | 278 | in_prog. | 10 / 3 / 56 | 1 / 1 / **0** (alive, never developed) |
|
||||
| 10 | 57 | victory | 2 / 2 / 3 | 1 / 0 / **0** |
|
||||
|
||||
P1 builds **zero buildings in 10/10 seeds**, including the two seeds where
|
||||
it survives to T278/T300 with a standing city. With no economy buildings it
|
||||
cannot reach `tier_peak ≥ 2`.
|
||||
|
||||
### F4 — The cause is a structural short-circuit, and the prior science uplift is dead code
|
||||
In `pick_for_city`, `classify_posture(threatened=true, …)` returns
|
||||
`Posture::Threatened`, and **step 1 returns a melee unit every turn** —
|
||||
before the early-mil floor, before expansion, and before the step-7
|
||||
building scorer. A trailing sole-city AI is *always* threatened
|
||||
(`enemy_mil_max > own_mil`), so it loops on military forever. Consequently
|
||||
the p1-29b/29c `sole_city_threatened` science uplift in `score_building`
|
||||
(step 7) **was never reached** in its target regime — it is dead code. No
|
||||
prior intervention actually changed the ladder ordering: p1-29c tuned
|
||||
action priors (`action_prior_with_context`), p1-29d tuned combat damage
|
||||
(`solo_city_grace`); neither let the trailing AI build economy.
|
||||
|
||||
## Ranked rule candidates
|
||||
|
||||
1. **(IMPLEMENTED) Sole-city economy break-out.** When `sole_city_threatened`
|
||||
and a minimal defender floor is met and the city holds fewer than N
|
||||
buildings, interject **one production-category building** before the
|
||||
step-1 military short-circuit. Directly closes the F3 gap with the F2
|
||||
lever (production, not science). Lowest-risk: gated entirely on
|
||||
`sole_city_threatened`, so no other player's ladder changes.
|
||||
2. **Drop / repurpose the dead science uplift.** The `sole_city_threatened`
|
||||
1.5× science multiplier in `score_building` is unreachable; either remove
|
||||
it (tech-debt) or fold it into the break-out's posture so the break-out
|
||||
can prefer science once production is covered. Deferred — F2 says the
|
||||
winning lever is production, so science-first is contraindicated.
|
||||
3. **Production-scale handicap (out of scope here).** p1-29d already
|
||||
concluded a 1-city AI cannot out-build a 3-city opponent on volume; if the
|
||||
break-out is insufficient, the next lever is mechanical (free defender on
|
||||
threat) or map-placement, per p1-29d's closing notes — not a priors tweak.
|
||||
|
||||
## Implemented patch
|
||||
|
||||
`src/simulator/crates/mc-ai/src/tactical/production.rs`:
|
||||
- `SOLE_CITY_ECON_MIN_DEFENDERS = 2`, `SOLE_CITY_ECON_TARGET = 2` consts.
|
||||
- New step-0 in `pick_for_city`: for a threatened sole-city AI with ≥2
|
||||
defenders and <2 buildings, return the best production/infrastructure
|
||||
building from the catalog (catalog-driven, Rail 2 preserved).
|
||||
- 4 unit tests: break-out fires → forge; stops at target → warrior; needs
|
||||
min defenders → warrior; multi-city unaffected → warrior. `cargo test -p
|
||||
mc-ai --lib`: 265 pass.
|
||||
|
||||
## Validation (before/after autoplay batch)
|
||||
|
||||
Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10,
|
||||
`tier_peak = 1` in 10/10, 0/10 gate.
|
||||
|
||||
After (this patch): _<pending — filled by the validation batch below>_
|
||||
|
||||
```
|
||||
PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after
|
||||
# analyze: P1 buildings_max > 0 ? P1 tier_peak >= 2 ? P1 median survival turn ?
|
||||
```
|
||||
|
||||
Per p1-29d iteration discipline: any directional movement (P1 builds >0
|
||||
buildings; survival-turn or tier_peak rises) confirms the lever direction
|
||||
even if the full ≥7/10 gate does not land in one pass. If the gate is still
|
||||
0–2/10, this objective stays `partial` and the next iteration targets the
|
||||
production-scale handicap (candidate 3), not another priors tweak.
|
||||
Loading…
Add table
Reference in a new issue