test(p1-29i): geometry ensemble — robust elimination lift, corrected baseline framing

A single-seed sweep is uninformative (scripted path barely responds to rng), so
add a geometry ensemble over INITIAL CONDITIONS: start-distance {4,5,6} ×
attacker-warriors {3,4,5} = 9 surfaces, at cooldown {0,3,5,8}.

Result (elimination-rate X/9):
  cd=0 (baseline): 5/9   (attacker-wins 5 / def 4)
  cd=3:            7/9   (att 7 / def 2)
  cd=5:            8/9   (att 7 / def 2)   <- peak
  cd=8:            6/9   (att 5 / def 4)

The refound cooldown produces a ROBUST lift (soft hump peaking cd 3-5, every
value >= baseline; cd=5 weakly dominates cd=0 cell-by-cell — better in 3/9,
worse in 0/9), shifting outcomes toward the heavier attacker. This overturns the
single-seed pessimism in the prior commit.

CORRECTED PREMISE: p1-29h's cited "20 captures / 0 eliminations" was a
single-GEOMETRY artifact (the dist=5/w=4 cell). Eliminations already occur in
5/9 baseline geometries — the lever RAISES an existing rate (5/9 -> 8/9), it
does not unlock elimination. p1-29h elimination bullet updated to reflect this.

DELIBERATELY NOT DONE (honest, evidence-bounded):
- No cd value authored into combat_balance.json — the lift is gridded-micro-
  surface only (9 geometries, 1 seed each); a live balance value needs the
  full-game 10-seed batch (tools/p1-survival-score.py, the multi-seed-tournament
  rule). The cd response is also an unexplained hump (cd=8 -> 6/9). Lever stays
  defaulted off.
- p1-29d NOT re-scored as converged — its gate is the full-game multi-gate
  scorecard (D1: P1 elim<=T100 OR stalled, 10/10), a different + heavier surface
  not run this pass. The brief's "re-score p1-29d" is gated on that measurement.

Tests (apricot): gridded harness non-ignored 1/1; ensemble + sweep run via
--ignored; cargo check --workspace 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
autocommit 2026-06-04 16:38:27 -07:00
parent a49e24c969
commit 40bb2b1f2c
3 changed files with 132 additions and 28 deletions

View file

@ -201,7 +201,19 @@ and are NOT the surface to measure on — use the gridded harness.
- `tools/p1-clean-baseline.py` — symmetric clean surface (needs asymmetric/`AUTO_PLAY_ALL_AI` extension).
## True state — 2026-06-04 gap analysis
**Verified:** 5/6. ✓ cross-turn `TacticalMemory` + army target-lock/hysteresis/press-on (`mc-ai/src/tactical/memory.rs`, commit 2ed93956d, 278 mc-ai tests green); ✓ gridded fair-duel surface (`mc-player-api/tests/p1_29h_gridded_elimination.rs`, a93c6e1b6); ✓ p1-29d re-scored on it. ✗ elimination: measured 20 captures / 0 eliminations / 38 refounds — lock engages but captures don't convert.
**Verified:** 5/6. ✓ cross-turn `TacticalMemory` + army target-lock/hysteresis/press-on (`mc-ai/src/tactical/memory.rs`, commit 2ed93956d, 278 mc-ai tests green); ✓ gridded fair-duel surface (`mc-player-api/tests/p1_29h_gridded_elimination.rs`, a93c6e1b6); ✓ p1-29d re-scored on it.
**UPDATE 2026-06-04 (p1-29i) — elimination bullet now ✓, with corrected framing.** The
"20 captures / 0 eliminations" reading was a **single-geometry artifact** (the dist=5/
warriors=4 cell). A 3×3 geometry ensemble (start-distance × attacker-warriors) shows
eliminations ALREADY occur in **5/9 baseline** geometries, and a data-driven post-capture
refound cooldown (`CombatBalance::refound_suppression`, p1-29i, defaulted off) **raises that
to 8/9 at cd=5**, weakly dominating baseline cell-by-cell (better in 3/9, worse in 0/9) and
shifting wins toward the heavier attacker (5→7). So ≥1 robust elimination IS achieved on the
fair surface — the lock engages AND captures convert across most geometries. **Caveats (see
p1-29i):** the lever's live JSON value is NOT yet authored (gridded-micro-surface validation
only; needs the full-game 10-seed batch) and p1-29d is NOT re-scored as converged (its gate is
the full-game scorecard, a different surface). Lever stays defaulted off; mechanism shipped.
**Path forward:** bottleneck is refound-suppression / capture-stickiness, NOT targeting. Next lever: suppress/delay enemy refound after a city loss (or make captures sticky), then re-measure ≥1 elimination. File as new AI objective (p1-29i refound-suppression).
**Blockers:** none for the lever; the measurement surface exists.
**Demo gate:** full-game-only — AI plays (moves/fights/captures); convergence is quality polish, not demo-blocking.

View file

@ -45,29 +45,48 @@ Data-driven (Rail 2) post-capture refound cooldown:
- ✓ Lever implemented, data-driven, defaulted off (no live-game impact until a value is
authored + justified). mc-core 262 lib, workspace check 0 (apricot 2026-06-04).
- ✓ Per-player elimination diagnostic added to the gridded harness; sweep harness
(`refound_suppression_lever_sweep`, `--ignored`) measures eliminations per cooldown value.
- ◐ **≥1 robust elimination on the fair surface** — **MEASURED, NOT YET ACHIEVED ROBUSTLY.**
Single-seed sweep (160t): cooldown `0→0 elim`, `5→1 elim` (slot 1 eliminated, final
`[41,0]`), `10→0`, `20→0`, `40→0`. The lone elimination at cd=5 is **NOT credited as
convergence**: (a) non-monotone (more suppression → fewer, not more, eliminations — no
coherent lever mechanism), (b) the *winner* swings slot-to-slot across the sweep (cd0
`[15,22]`, cd5 `[41,0]`, cd10 `[2,31]`) = chaotic-snowball perturbation, not a lever
response, (c) it fails p1-29h's own literal gate (`min_total_cities < start` — the combined
proxy never dipped). This is a single-seed spike, i.e. luck/timing, exactly the
"don't fake an elimination" the brief forbids. Honest verdict pending a multi-seed /
geometry ensemble.
- ☐ Re-score p1-29d as converged — **NOT done; explicitly withheld** pending robust
multi-seed evidence. p1-29d stays unconverged.
- ✓ Per-player elimination diagnostic + sweep + **geometry ensemble** added to the gridded
harness (`refound_suppression_lever_sweep` + `refound_suppression_geometry_ensemble`,
`--ignored`).
- ✓ **≥1 robust elimination on the fair surface — ACHIEVED (geometry ensemble, 2026-06-04).**
A single-seed sweep is uninformative (the scripted path barely responds to rng), so a
3×3 ensemble over INITIAL CONDITIONS — start-distance {4,5,6} × attacker-warriors {3,4,5} —
measures elimination-rate (X/9) at each cooldown:
- **cd=0 (baseline): 5/9** eliminations, attacker-wins 5/defender-wins 4.
- **cd=3: 7/9** (attacker 7/def 2); **cd=5: 8/9** (attacker 7/def 2); **cd=8: 6/9** (att 5/def 4).
Every non-zero cooldown lifts the rate above baseline; cd=5 weakly dominates cd=0 cell-by-cell
(better in 3/9, worse in 0/9). The lift is robust (a soft hump peaking at cd 3-5), not the
single-seed spike the earlier sweep suggested.
- ◐ **Corrected premise (important):** the p1-29h Phase-2 cited "20 captures / 0 eliminations"
was a **single-geometry artifact** — the dist=5/warriors=4 cell happens to be a 0-elim cell.
Across geometries, eliminations ALREADY occur in 5/9 baseline conditions. So the honest
finding is NOT "the lever unlocks elimination"; it is **"eliminations occur in 5/9 baseline
geometries; the refound cooldown raises that to 8/9 (cd=5) with no per-cell regressions."**
- ☐ **Author cd=5 into `combat_balance.json` — DEFERRED (not done).** The lift is real on the
GRIDDED MICRO-surface (9 geometries, 1 seed each), but a live-game balance value requires the
full-game 10-seed batch validation (`tools/p1-survival-score.py`, the balance-philosophy
"multi-seed tournament" rule), which is a different + heavier surface not run this pass. The
cd response is also a HUMP (cd=8 → 6/9, below the cd=5 peak) whose mechanism is unexplained —
another reason not to bake a knife-near value live yet. Lever stays **defaulted off**.
- ☐ Re-score p1-29d as converged — **NOT done.** p1-29d's gate is the multi-gate full-game
10-seed scorecard (D1 convergence = P1 elim≤T100 OR stalled, 10/10, via the autoplay batch),
NOT "≥1 elimination on the gridded micro-duel." This objective did not run that surface, so
p1-29d stays unconverged. (The brief's "re-score p1-29d" is gated on its own measurement.)
## Honest result (2026-06-04)
The lever **demonstrably suppresses refounding** (founds trend down as cooldown rises:
38→39→37→34→34) — the mechanism works. But suppression ALONE does not robustly convert a
capture into an elimination on the fair surface: the one elimination observed is single-seed
noise, not a converged outcome. Per the brief, this measured-negative + the working mechanism
(defaulted off, no degenerate value forced) is the deliverable. The bullet stays open; do not
author a non-zero `combat_balance.json` value without robust multi-seed support.
The lever **works**: it suppresses refounding (founds trend down with cooldown) and on the
gridded geometry ensemble it raises the elimination-rate from a **5/9 baseline to 8/9 at cd=5**,
weakly dominating baseline cell-by-cell and shifting outcomes toward the heavier attacker
(5→7 attacker-wins). This is a real, robust lift — overturning the initial single-seed
pessimism. **But two honest caveats keep the live JSON unauthored and p1-29d unconverged:**
(1) the baseline was already 5/9, so the lever *raises* an existing elimination rate rather
than *unlocking* it — the "0 eliminations" premise was a single-cell artifact; (2) validation
is on the gridded micro-surface only — a live balance value needs the full-game 10-seed batch,
and the cd response is an unexplained hump. Terminal state: mechanism shipped + defaulted off +
robust micro-surface lift recorded; live authoring + p1-29d re-score deferred to a full-game
batch. Per the brief, reporting the measured result + the tradeoff honestly (no degenerate
value forced, no fabricated convergence) is the deliverable.
## Source-of-truth rails

View file

@ -157,8 +157,20 @@ fn build_gridded_duel(attacker_warriors: i32) -> (GameState, u8) {
}
/// As [`build_gridded_duel`] but sets the p1-29i post-capture refound cooldown
/// (`0` = lever disabled / baseline).
/// (`0` = lever disabled / baseline). Fixed default geometry (distance 5).
fn build_gridded_duel_with_cooldown(attacker_warriors: i32, refound_cooldown: u32) -> (GameState, u8) {
build_gridded_duel_geom(attacker_warriors, refound_cooldown, 5)
}
/// Parameterized geometry: defender placed `distance` tiles east of the
/// attacker (column 6). Varying distance perturbs initial conditions for the
/// p1-29i multi-condition ensemble (rng alone barely moves a near-deterministic
/// scripted path; geometry does).
fn build_gridded_duel_geom(
attacker_warriors: i32,
refound_cooldown: u32,
distance: i32,
) -> (GameState, u8) {
let mut state = GameState::default();
state.turn = 1;
state.units_catalog = build_runtime_units_catalog();
@ -168,12 +180,10 @@ fn build_gridded_duel_with_cooldown(attacker_warriors: i32, refound_cooldown: u3
state.combat_balance.refound_suppression.cooldown_turns = refound_cooldown;
state.grid = Some(flat_grid(24, 24, "grassland"));
// Two combatants 5 tiles apart on the same row — within a few turns'
// march so contact is made early. Attacker gets a heavier stack so a
// killing blow is reachable (the spec's whole point is whether the LOCK
// turns a capture into an elimination, not whether a 1:1 fight stalls).
// Attacker at col 6, defender `distance` east. Attacker gets a heavier
// stack so a killing blow is reachable.
let p0 = militarist(&mut state, 6, 12, "blackhammer", attacker_warriors);
let p1 = militarist(&mut state, 11, 12, "deepforge", 2);
let p1 = militarist(&mut state, 6 + distance, 12, "deepforge", 2);
let ender = passive_ender(&mut state);
// War between the two combatants (authoritative table on players[0]).
@ -221,9 +231,16 @@ fn drive(max_turns: u32) -> Probe {
/// As [`drive`] but with the p1-29i post-capture refound cooldown applied.
fn drive_with_cooldown(max_turns: u32, refound_cooldown: u32) -> Probe {
drive_geom(max_turns, refound_cooldown, 4, 5)
}
/// Fully parameterized drive for the p1-29i ensemble: cooldown × attacker
/// warriors × start distance.
fn drive_geom(max_turns: u32, refound_cooldown: u32, attacker_warriors: i32, distance: i32) -> Probe {
use mc_player_api::wire::Event;
let (mut state, ender) = build_gridded_duel_with_cooldown(4, refound_cooldown);
let (mut state, ender) =
build_gridded_duel_geom(attacker_warriors, refound_cooldown, distance);
let mut probe = Probe::default();
let combatants = [0u8, 1u8];
probe.start_total_cities = combatants
@ -459,3 +476,59 @@ fn refound_suppression_lever_sweep() {
max-cooldown founds={max_cd_founds})",
);
}
/// p1-29i — multi-condition ENSEMBLE: the discriminator between a real lever and
/// single-seed noise. A near-deterministic scripted path barely responds to
/// `game_rng_seed`, so the ensemble varies INITIAL CONDITIONS (start distance ×
/// attacker stack size) — 3×3 = 9 surfaces — at the baseline (cd=0) vs the
/// candidate cd=5, and reports elimination-rate (X/9) + attacker-win-rate.
///
/// Honest gate: only that the ensemble RAN and produced a comparison. The
/// elimination-rate delta is the recorded measurement, not a pass/fail —
/// whether cd=5 robustly converts captures is the deliverable in EITHER
/// direction. `#[ignore]`; invoke via `--ignored --nocapture`.
#[test]
#[ignore = "p1-29i geometry ensemble (recorded measurement); invoke via --ignored --nocapture"]
fn refound_suppression_geometry_ensemble() {
let distances = [4i32, 5, 6];
let warriors = [3i32, 4, 5];
// Neighborhood sweep around the candidate cd=5 to distinguish a PLATEAU
// (robust → authorable) from a SPIKE (knife-edge → do not author).
let cooldowns = [0u32, 3, 5, 8];
for &cd in &cooldowns {
let mut elims = 0usize;
let mut attacker_wins = 0usize; // slot 0 strictly more cities than slot 1.
let mut defender_wins = 0usize;
let mut n = 0usize;
for &d in &distances {
for &w in &warriors {
let probe = drive_geom(160, cd, w, d);
n += 1;
if probe.eliminations >= 1 {
elims += 1;
}
let (c0, c1) = (probe.final_cities[0], probe.final_cities[1]);
if c0 > c1 {
attacker_wins += 1;
} else if c1 > c0 {
defender_wins += 1;
}
eprintln!(
" cd={cd} dist={d} warriors={w}: elims={} final_cities={:?} \
per_player_min={:?}",
probe.eliminations, probe.final_cities, probe.per_player_min_cities,
);
}
}
eprintln!(
"p1-29i ENSEMBLE cd={cd}: elimination_rate={elims}/{n} \
attacker_wins={attacker_wins} defender_wins={defender_wins}",
);
}
eprintln!(
"p1-29i ENSEMBLE VERDICT: compare elimination_rate(cd=0) vs (cd=5). A robust lever \
needs a clear rate lift AND a consistent winner; a flat/random delta confirms \
suppression-alone does not convert (bullet stays open, p1-29d NOT converged).",
);
}