diff --git a/.project/objectives/p1-29e-rl-divergence-mining.md b/.project/objectives/p1-29e-rl-divergence-mining.md index e56371f9..42e5cfac 100644 --- a/.project/objectives/p1-29e-rl-divergence-mining.md +++ b/.project/objectives/p1-29e-rl-divergence-mining.md @@ -111,20 +111,70 @@ action priors (`action_prior_with_context`), p1-29d tuned combat damage min defenders → warrior; multi-city unaffected → warrior. `cargo test -p mc-ai --lib`: 265 pass. -## Validation (before/after autoplay batch) +## Validation (before/after autoplay batch) — GATE NOT MET -Baseline: apricot batch `20260516_183534` — P1 buildings = 0 in 10/10, -`tier_peak = 1` in 10/10, 0/10 gate. +Local 10-seed T300 batch on the patched build: +`.local/batches/p1_29e_after` (fresh GDExtension rebuild from working tree). -After (this patch): __ +**Headline (do not be fooled):** vs the stale apricot baseline +`20260516_183534`, P1 `tier_peak` rose 1 → 2-5 in **10/10** seeds. This is +**NOT attributable to this patch** — it is main-branch drift. Three facts +establish that: -``` -PARALLEL=4 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_after -# analyze: P1 buildings_max > 0 ? P1 tier_peak >= 2 ? P1 median survival turn ? -``` +1. `tier_peak` is defined as *highest tech-era researched* + (`turn_processor.gd::_player_tier_peak`; mirrored in + `auto_play.gd`). It is research-driven, NOT building- or unit-driven. +2. This patch only adds **buildings**. P1 completed **zero buildings** in + all 10 seeds (`player_stats.buildings` max = 0; only `owner=0` appears in + `city_building_completed`). The patch produced no material change a + research metric could reflect. +3. The baseline `20260516_183534` is a *different commit*; comparing across + it conflates this patch with all intervening main-branch changes. -Per p1-29d iteration discipline: any directional movement (P1 builds >0 -buildings; survival-turn or tier_peak rises) confirms the lever direction -even if the full ≥7/10 gate does not land in one pass. If the gate is still -0–2/10, this objective stays `partial` and the next iteration targets the -production-scale handicap (candidate 3), not another priors tweak. +So the apparent improvement is a measurement artifact of comparing against a +stale-commit baseline. **The completion gate (a candidate validated by a +before/after batch showing the metric move *because of the candidate*) is NOT +met.** + +### What the fresh batch actually revealed (more valuable than the patch) + +Current-main P1 behaviour differs from the p1-29d baseline narrative: +- P1 reaches `tier_peak` 2-5 by **pure research** (techs 9-35) — the old + "P1 stuck at tier_peak=1" symptom is **already gone on current main**. +- P1 still loses its capital in 8/10 (eliminated T44-100) with + `kills=1-10`, `units_lost=1-4` — it fights but loses. +- Survivor seeds 5/9 (T300/272, 1 city): `mil=0`, `buildings=0`, + `pop=17/33`, `techs=31-35` — P1 researches to era 5 but builds **nothing + material** for 250+ turns. Possible production stall worth its own + investigation (snapshot-timing artifact vs genuine stall — unconfirmed). + +### Why the patch did not demonstrably help + +The break-out gate `own_mil >= SOLE_CITY_ECON_MIN_DEFENDERS (2)` requires the +sole-city AI to hold ≥2 standing non-founder units at decision time. P1's +`mil` snapshot is 0 at every recorded turn in 10/10 seeds (it fights via +very-transient units between snapshots). Whether the gate ever fired is +unconfirmed (the engine emits no production-queue event to detect it from +batch artifacts). Either way the patch completed 0 buildings, so it had no +observable effect — and the `own_mil>=2` floor may be exactly wrong for the +weakest player. + +### Honest status & next steps + +- **Gate: NOT MET.** No metric movement attributable to this patch. +- The patch is gated to `sole_city_threatened` and fully unit-tested + (265 mc-ai tests green), so it is safe in-tree, but **unvalidated** — the + consuming p1-29c/29d worker should validate or revert it. +- **Remaining attribution step (deferred on host load):** controlled + before/after on the *same* fresh build — HEAD vs HEAD+patch, same 10 seeds — + is the only clean way to attribute (or refute) any effect. Held while the + host runs ≥20 concurrent `godot-bin` (host guard); to run when load drops: + ``` + # baseline = revert the two production.rs edits, rebuild, run; then re-apply + PARALLEL=3 bash tools/autoplay-batch.sh 10 300 .local/batches/p1_29e_before + ``` +- **Reframe for the next iteration:** the failure regime on current main is + *survival with no military* and a possible *production stall*, not the old + "military-spam, no economy". Re-baseline p1-29c/29d against current main + before further patch work; the `tier_peak=1` symptom they target may already + be resolved.