diff --git a/.project/objectives/p1-29h-stateful-tactical-decisiveness.md b/.project/objectives/p1-29h-stateful-tactical-decisiveness.md index 371c7c7d..af3e8988 100644 --- a/.project/objectives/p1-29h-stateful-tactical-decisiveness.md +++ b/.project/objectives/p1-29h-stateful-tactical-decisiveness.md @@ -201,7 +201,19 @@ and are NOT the surface to measure on — use the gridded harness. - `tools/p1-clean-baseline.py` — symmetric clean surface (needs asymmetric/`AUTO_PLAY_ALL_AI` extension). ## True state — 2026-06-04 gap analysis -**Verified:** 5/6. ✓ cross-turn `TacticalMemory` + army target-lock/hysteresis/press-on (`mc-ai/src/tactical/memory.rs`, commit 2ed93956d, 278 mc-ai tests green); ✓ gridded fair-duel surface (`mc-player-api/tests/p1_29h_gridded_elimination.rs`, a93c6e1b6); ✓ p1-29d re-scored on it. ✗ elimination: measured 20 captures / 0 eliminations / 38 refounds — lock engages but captures don't convert. +**Verified:** 5/6. ✓ cross-turn `TacticalMemory` + army target-lock/hysteresis/press-on (`mc-ai/src/tactical/memory.rs`, commit 2ed93956d, 278 mc-ai tests green); ✓ gridded fair-duel surface (`mc-player-api/tests/p1_29h_gridded_elimination.rs`, a93c6e1b6); ✓ p1-29d re-scored on it. + +**UPDATE 2026-06-04 (p1-29i) — elimination bullet now ✓, with corrected framing.** The +"20 captures / 0 eliminations" reading was a **single-geometry artifact** (the dist=5/ +warriors=4 cell). A 3×3 geometry ensemble (start-distance × attacker-warriors) shows +eliminations ALREADY occur in **5/9 baseline** geometries, and a data-driven post-capture +refound cooldown (`CombatBalance::refound_suppression`, p1-29i, defaulted off) **raises that +to 8/9 at cd=5**, weakly dominating baseline cell-by-cell (better in 3/9, worse in 0/9) and +shifting wins toward the heavier attacker (5→7). So ≥1 robust elimination IS achieved on the +fair surface — the lock engages AND captures convert across most geometries. **Caveats (see +p1-29i):** the lever's live JSON value is NOT yet authored (gridded-micro-surface validation +only; needs the full-game 10-seed batch) and p1-29d is NOT re-scored as converged (its gate is +the full-game scorecard, a different surface). Lever stays defaulted off; mechanism shipped. **Path forward:** bottleneck is refound-suppression / capture-stickiness, NOT targeting. Next lever: suppress/delay enemy refound after a city loss (or make captures sticky), then re-measure ≥1 elimination. File as new AI objective (p1-29i refound-suppression). **Blockers:** none for the lever; the measurement surface exists. **Demo gate:** full-game-only — AI plays (moves/fights/captures); convergence is quality polish, not demo-blocking. diff --git a/.project/objectives/p1-29i-refound-suppression.md b/.project/objectives/p1-29i-refound-suppression.md index 25e22d18..9307055a 100644 --- a/.project/objectives/p1-29i-refound-suppression.md +++ b/.project/objectives/p1-29i-refound-suppression.md @@ -45,29 +45,48 @@ Data-driven (Rail 2) post-capture refound cooldown: - ✓ Lever implemented, data-driven, defaulted off (no live-game impact until a value is authored + justified). mc-core 262 lib, workspace check 0 (apricot 2026-06-04). -- ✓ Per-player elimination diagnostic added to the gridded harness; sweep harness - (`refound_suppression_lever_sweep`, `--ignored`) measures eliminations per cooldown value. -- ◐ **≥1 robust elimination on the fair surface** — **MEASURED, NOT YET ACHIEVED ROBUSTLY.** - Single-seed sweep (160t): cooldown `0→0 elim`, `5→1 elim` (slot 1 eliminated, final - `[41,0]`), `10→0`, `20→0`, `40→0`. The lone elimination at cd=5 is **NOT credited as - convergence**: (a) non-monotone (more suppression → fewer, not more, eliminations — no - coherent lever mechanism), (b) the *winner* swings slot-to-slot across the sweep (cd0 - `[15,22]`, cd5 `[41,0]`, cd10 `[2,31]`) = chaotic-snowball perturbation, not a lever - response, (c) it fails p1-29h's own literal gate (`min_total_cities < start` — the combined - proxy never dipped). This is a single-seed spike, i.e. luck/timing, exactly the - "don't fake an elimination" the brief forbids. Honest verdict pending a multi-seed / - geometry ensemble. -- ☐ Re-score p1-29d as converged — **NOT done; explicitly withheld** pending robust - multi-seed evidence. p1-29d stays unconverged. +- ✓ Per-player elimination diagnostic + sweep + **geometry ensemble** added to the gridded + harness (`refound_suppression_lever_sweep` + `refound_suppression_geometry_ensemble`, + `--ignored`). +- ✓ **≥1 robust elimination on the fair surface — ACHIEVED (geometry ensemble, 2026-06-04).** + A single-seed sweep is uninformative (the scripted path barely responds to rng), so a + 3×3 ensemble over INITIAL CONDITIONS — start-distance {4,5,6} × attacker-warriors {3,4,5} — + measures elimination-rate (X/9) at each cooldown: + - **cd=0 (baseline): 5/9** eliminations, attacker-wins 5/defender-wins 4. + - **cd=3: 7/9** (attacker 7/def 2); **cd=5: 8/9** (attacker 7/def 2); **cd=8: 6/9** (att 5/def 4). + Every non-zero cooldown lifts the rate above baseline; cd=5 weakly dominates cd=0 cell-by-cell + (better in 3/9, worse in 0/9). The lift is robust (a soft hump peaking at cd 3-5), not the + single-seed spike the earlier sweep suggested. +- ◐ **Corrected premise (important):** the p1-29h Phase-2 cited "20 captures / 0 eliminations" + was a **single-geometry artifact** — the dist=5/warriors=4 cell happens to be a 0-elim cell. + Across geometries, eliminations ALREADY occur in 5/9 baseline conditions. So the honest + finding is NOT "the lever unlocks elimination"; it is **"eliminations occur in 5/9 baseline + geometries; the refound cooldown raises that to 8/9 (cd=5) with no per-cell regressions."** +- ☐ **Author cd=5 into `combat_balance.json` — DEFERRED (not done).** The lift is real on the + GRIDDED MICRO-surface (9 geometries, 1 seed each), but a live-game balance value requires the + full-game 10-seed batch validation (`tools/p1-survival-score.py`, the balance-philosophy + "multi-seed tournament" rule), which is a different + heavier surface not run this pass. The + cd response is also a HUMP (cd=8 → 6/9, below the cd=5 peak) whose mechanism is unexplained — + another reason not to bake a knife-near value live yet. Lever stays **defaulted off**. +- ☐ Re-score p1-29d as converged — **NOT done.** p1-29d's gate is the multi-gate full-game + 10-seed scorecard (D1 convergence = P1 elim≤T100 OR stalled, 10/10, via the autoplay batch), + NOT "≥1 elimination on the gridded micro-duel." This objective did not run that surface, so + p1-29d stays unconverged. (The brief's "re-score p1-29d" is gated on its own measurement.) ## Honest result (2026-06-04) -The lever **demonstrably suppresses refounding** (founds trend down as cooldown rises: -38→39→37→34→34) — the mechanism works. But suppression ALONE does not robustly convert a -capture into an elimination on the fair surface: the one elimination observed is single-seed -noise, not a converged outcome. Per the brief, this measured-negative + the working mechanism -(defaulted off, no degenerate value forced) is the deliverable. The bullet stays open; do not -author a non-zero `combat_balance.json` value without robust multi-seed support. +The lever **works**: it suppresses refounding (founds trend down with cooldown) and on the +gridded geometry ensemble it raises the elimination-rate from a **5/9 baseline to 8/9 at cd=5**, +weakly dominating baseline cell-by-cell and shifting outcomes toward the heavier attacker +(5→7 attacker-wins). This is a real, robust lift — overturning the initial single-seed +pessimism. **But two honest caveats keep the live JSON unauthored and p1-29d unconverged:** +(1) the baseline was already 5/9, so the lever *raises* an existing elimination rate rather +than *unlocking* it — the "0 eliminations" premise was a single-cell artifact; (2) validation +is on the gridded micro-surface only — a live balance value needs the full-game 10-seed batch, +and the cd response is an unexplained hump. Terminal state: mechanism shipped + defaulted off + +robust micro-surface lift recorded; live authoring + p1-29d re-score deferred to a full-game +batch. Per the brief, reporting the measured result + the tradeoff honestly (no degenerate +value forced, no fabricated convergence) is the deliverable. ## Source-of-truth rails diff --git a/src/simulator/crates/mc-player-api/tests/p1_29h_gridded_elimination.rs b/src/simulator/crates/mc-player-api/tests/p1_29h_gridded_elimination.rs index bb4f84db..92c6a445 100644 --- a/src/simulator/crates/mc-player-api/tests/p1_29h_gridded_elimination.rs +++ b/src/simulator/crates/mc-player-api/tests/p1_29h_gridded_elimination.rs @@ -157,8 +157,20 @@ fn build_gridded_duel(attacker_warriors: i32) -> (GameState, u8) { } /// As [`build_gridded_duel`] but sets the p1-29i post-capture refound cooldown -/// (`0` = lever disabled / baseline). +/// (`0` = lever disabled / baseline). Fixed default geometry (distance 5). fn build_gridded_duel_with_cooldown(attacker_warriors: i32, refound_cooldown: u32) -> (GameState, u8) { + build_gridded_duel_geom(attacker_warriors, refound_cooldown, 5) +} + +/// Parameterized geometry: defender placed `distance` tiles east of the +/// attacker (column 6). Varying distance perturbs initial conditions for the +/// p1-29i multi-condition ensemble (rng alone barely moves a near-deterministic +/// scripted path; geometry does). +fn build_gridded_duel_geom( + attacker_warriors: i32, + refound_cooldown: u32, + distance: i32, +) -> (GameState, u8) { let mut state = GameState::default(); state.turn = 1; state.units_catalog = build_runtime_units_catalog(); @@ -168,12 +180,10 @@ fn build_gridded_duel_with_cooldown(attacker_warriors: i32, refound_cooldown: u3 state.combat_balance.refound_suppression.cooldown_turns = refound_cooldown; state.grid = Some(flat_grid(24, 24, "grassland")); - // Two combatants 5 tiles apart on the same row — within a few turns' - // march so contact is made early. Attacker gets a heavier stack so a - // killing blow is reachable (the spec's whole point is whether the LOCK - // turns a capture into an elimination, not whether a 1:1 fight stalls). + // Attacker at col 6, defender `distance` east. Attacker gets a heavier + // stack so a killing blow is reachable. let p0 = militarist(&mut state, 6, 12, "blackhammer", attacker_warriors); - let p1 = militarist(&mut state, 11, 12, "deepforge", 2); + let p1 = militarist(&mut state, 6 + distance, 12, "deepforge", 2); let ender = passive_ender(&mut state); // War between the two combatants (authoritative table on players[0]). @@ -221,9 +231,16 @@ fn drive(max_turns: u32) -> Probe { /// As [`drive`] but with the p1-29i post-capture refound cooldown applied. fn drive_with_cooldown(max_turns: u32, refound_cooldown: u32) -> Probe { + drive_geom(max_turns, refound_cooldown, 4, 5) +} + +/// Fully parameterized drive for the p1-29i ensemble: cooldown × attacker +/// warriors × start distance. +fn drive_geom(max_turns: u32, refound_cooldown: u32, attacker_warriors: i32, distance: i32) -> Probe { use mc_player_api::wire::Event; - let (mut state, ender) = build_gridded_duel_with_cooldown(4, refound_cooldown); + let (mut state, ender) = + build_gridded_duel_geom(attacker_warriors, refound_cooldown, distance); let mut probe = Probe::default(); let combatants = [0u8, 1u8]; probe.start_total_cities = combatants @@ -459,3 +476,59 @@ fn refound_suppression_lever_sweep() { max-cooldown founds={max_cd_founds})", ); } + +/// p1-29i — multi-condition ENSEMBLE: the discriminator between a real lever and +/// single-seed noise. A near-deterministic scripted path barely responds to +/// `game_rng_seed`, so the ensemble varies INITIAL CONDITIONS (start distance × +/// attacker stack size) — 3×3 = 9 surfaces — at the baseline (cd=0) vs the +/// candidate cd=5, and reports elimination-rate (X/9) + attacker-win-rate. +/// +/// Honest gate: only that the ensemble RAN and produced a comparison. The +/// elimination-rate delta is the recorded measurement, not a pass/fail — +/// whether cd=5 robustly converts captures is the deliverable in EITHER +/// direction. `#[ignore]`; invoke via `--ignored --nocapture`. +#[test] +#[ignore = "p1-29i geometry ensemble (recorded measurement); invoke via --ignored --nocapture"] +fn refound_suppression_geometry_ensemble() { + let distances = [4i32, 5, 6]; + let warriors = [3i32, 4, 5]; + // Neighborhood sweep around the candidate cd=5 to distinguish a PLATEAU + // (robust → authorable) from a SPIKE (knife-edge → do not author). + let cooldowns = [0u32, 3, 5, 8]; + + for &cd in &cooldowns { + let mut elims = 0usize; + let mut attacker_wins = 0usize; // slot 0 strictly more cities than slot 1. + let mut defender_wins = 0usize; + let mut n = 0usize; + for &d in &distances { + for &w in &warriors { + let probe = drive_geom(160, cd, w, d); + n += 1; + if probe.eliminations >= 1 { + elims += 1; + } + let (c0, c1) = (probe.final_cities[0], probe.final_cities[1]); + if c0 > c1 { + attacker_wins += 1; + } else if c1 > c0 { + defender_wins += 1; + } + eprintln!( + " cd={cd} dist={d} warriors={w}: elims={} final_cities={:?} \ + per_player_min={:?}", + probe.eliminations, probe.final_cities, probe.per_player_min_cities, + ); + } + } + eprintln!( + "p1-29i ENSEMBLE cd={cd}: elimination_rate={elims}/{n} \ + attacker_wins={attacker_wins} defender_wins={defender_wins}", + ); + } + eprintln!( + "p1-29i ENSEMBLE VERDICT: compare elimination_rate(cd=0) vs (cd=5). A robust lever \ + needs a clear rate lift AND a consistent winner; a flat/random delta confirms \ + suppression-alone does not convert (bullet stays open, p1-29d NOT converged).", + ); +}