docs(objectives): 📝 Update validation section in RL divergence mining objectives to reflect current state and add batch analysis details

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-05-27 21:15:13 -07:00
parent b7353d43e2
commit 32a694eab7

View file

@ -111,20 +111,70 @@ action priors (`action_prior_with_context`), p1-29d tuned combat damage
min defenders → warrior; multi-city unaffected → warrior. `cargo test -p
mc-ai --lib`: 265 pass.
## Validation (before/after autoplay batch)
## Validation (before/after autoplay batch) — GATE NOT MET
Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10,
`tier_peak = 1` in 10/10, 0/10 gate.
Local 10-seed T300 batch on the patched build:
`.local/batches/p1_29e_after` (fresh GDExtension rebuild from working tree).
After (this patch): _<pending filled by the validation batch below>_
**Headline (do not be fooled):** vs the stale apricot baseline
`20260516_183534`, P1 `tier_peak` rose 1 → 2-5 in **10/10** seeds. This is
**NOT attributable to this patch** — it is main-branch drift. Three facts
establish that:
```
PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after
# analyze: P1 buildings_max > 0 ? P1 tier_peak >= 2 ? P1 median survival turn ?
```
1. `tier_peak` is defined as *highest tech-era researched*
(`turn_processor.gd::_player_tier_peak`; mirrored in
`auto_play.gd`). It is research-driven, NOT building- or unit-driven.
2. This patch only adds **buildings**. P1 completed **zero buildings** in
all 10 seeds (`player_stats.buildings` max = 0; only `owner=0` appears in
`city_building_completed`). The patch produced no material change a
research metric could reflect.
3. The baseline `20260516_183534` is a *different commit*; comparing across
it conflates this patch with all intervening main-branch changes.
Per p1-29d iteration discipline: any directional movement (P1 builds >0
buildings; survival-turn or tier_peak rises) confirms the lever direction
even if the full ≥7/10 gate does not land in one pass. If the gate is still
02/10, this objective stays `partial` and the next iteration targets the
production-scale handicap (candidate 3), not another priors tweak.
So the apparent improvement is a measurement artifact of comparing against a
stale-commit baseline. **The completion gate (a candidate validated by a
before/after batch showing the metric move *because of the candidate*) is NOT
met.**
### What the fresh batch actually revealed (more valuable than the patch)
Current-main P1 behaviour differs from the p1-29d baseline narrative:
- P1 reaches `tier_peak` 2-5 by **pure research** (techs 9-35) — the old
"P1 stuck at tier_peak=1" symptom is **already gone on current main**.
- P1 still loses its capital in 8/10 (eliminated T44-100) with
`kills=1-10`, `units_lost=1-4` — it fights but loses.
- Survivor seeds 5/9 (T300/272, 1 city): `mil=0`, `buildings=0`,
`pop=17/33`, `techs=31-35` — P1 researches to era 5 but builds **nothing
material** for 250+ turns. Possible production stall worth its own
investigation (snapshot-timing artifact vs genuine stall — unconfirmed).
### Why the patch did not demonstrably help
The break-out gate `own_mil >= SOLE_CITY_ECON_MIN_DEFENDERS (2)` requires the
sole-city AI to hold ≥2 standing non-founder units at decision time. P1's
`mil` snapshot is 0 at every recorded turn in 10/10 seeds (it fights via
very-transient units between snapshots). Whether the gate ever fired is
unconfirmed (the engine emits no production-queue event to detect it from
batch artifacts). Either way the patch completed 0 buildings, so it had no
observable effect — and the `own_mil>=2` floor may be exactly wrong for the
weakest player.
### Honest status & next steps
- **Gate: NOT MET.** No metric movement attributable to this patch.
- The patch is gated to `sole_city_threatened` and fully unit-tested
(265 mc-ai tests green), so it is safe in-tree, but **unvalidated** — the
consuming p1-29c/29d worker should validate or revert it.
- **Remaining attribution step (deferred on host load):** controlled
before/after on the *same* fresh build — HEAD vs HEAD+patch, same 10 seeds —
is the only clean way to attribute (or refute) any effect. Held while the
host runs ≥20 concurrent `godot-bin` (host guard); to run when load drops:
```
# baseline = revert the two production.rs edits, rebuild, run; then re-apply
PARALLEL=3 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_before
```
- **Reframe for the next iteration:** the failure regime on current main is
*survival with no military* and a possible *production stall*, not the old
"military-spam, no economy". Re-baseline p1-29c/29d against current main
before further patch work; the `tier_peak=1` symptom they target may already
be resolved.