docs(objectives): 📝 Update survival objective scoring criteria and status documentation

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
autocommit 2026-06-03 02:17:30 -07:00
parent 55935afbd2
commit eebe37afae

View file

@ -184,6 +184,31 @@ In the two seeds where P0 ignores P1 (s5/s9: P1 alive at T286/T300, never attack
These mean the D1/D2/D3 columns may be partly measuring harness asymmetry (A), not trailing-AI quality. Before tuning P1 further, the design question is whether the gate should: (i) fix the production stall (B) and re-measure; (ii) symmetrize the harness (P0 = plain AI, or P1 also gets the helpers) so the matchup is fair; or (iii) redefine the trailing-AI gate to be measured in a same-controller context (cf. p1-29g's trained-vs-scripted framing). **Held for operator decision — no code changed.**
## Status (2026-06-03) — D1 re-scored at turn≤100; the stale `tp≤1` proxy is the binding constraint, not balance
Dispatch reaffirmed **D1 (convergence: P1 eliminated OR stalled before T100, 10/10)** as the canonical gate for the Pre-Wave-1 pacing requirement. Re-scored the intact `20260529_185955` batch (still on apricot; HEAD `7f2eef48c` has only moved on RL-export/comms paths since the `e22d78fa5` baseline, so the convergence regime is unchanged). New tool `tools/p1-convergence-lens.py` samples P1's state **at the last turn ≤100** instead of end-of-game, and reports two readings of "stalled":
| seed | endT | P1 elimT | @T100: cities / mil / kills / tp | D1-literal (tp≤1) | D1-nonfactor (mil=0,c≤1,kills=0) |
|---|---|---|---|---|---|
| 1 | 63 | 63 | dead | OK | OK |
| 2 | 44 | 44 | dead | OK | OK |
| 3 | 153 | 153 | 1 / 0 / **2** / 3 | NO | **NO** |
| 4 | 100 | 100 | dead | OK | OK |
| 5 | 300 | — | 1 / 0 / **0** / 5 | NO | **OK** |
| 6 | 78 | 78 | dead | OK | OK |
| 7 | 203 | 203 | 1 / 0 / **3** / 5 | NO | **NO** |
| 8 | 65 | 65 | dead | OK | OK |
| 9 | 286 | — | 1 / 0 / **0** / 3 | NO | **OK** |
| 10 | 56 | 56 | dead | OK | OK |
- **D1-literal: 6/10** (unchanged — confirms baseline holds on current main).
- **D1-nonfactor: 8/10.** The zombie tail **s5/s9** is alive at T100 with `mil=0, 1 city, kills=0` — genuinely inert by the objective's own D3 definition. They fail D1 **only** because research-drift inflated `tier_peak` to 5/3 (> 1); `tier_peak` is engine-auto research, not a policy/threat signal (p1-29e F1, p1-29g note). Recognizing functional stall fixes both with **zero code**.
- **The 2 residual misses are s3/s7 only.** At T100 both are alive, `mil=0`, 1 city, but `kills=2/3` — P1 is *mid-siege, actively killing attackers*, and falls at T153/T203. These are genuine non-convergences under **either** reading. They are not zombies; they are a trailing AI legitimately resisting capture for 50100 turns past T100.
**Implication.** The binding constraint on D1 is a **gate-semantics question, not a balance lever**: (a) "stalled" as literal `tp≤1` vs. "stalled" as functional non-factor (`mil=0 & cities≤1 & kills=0`). Under (a) the gate is 6/10; under the non-factor reading it is 8/10. The only seeds needing actual engineering are s3/s7, and per **Finding A** (confirmed verbatim: `auto_play.gd` slot-0 is the juiced harness player) "converging" them faster means tuning the **non-shipping** P0 siege — the exact dead end the whole p1-29 family keeps hitting. Alternatively s3/s7 satisfy a 4X-legitimacy reading ("trailing AI may legitimately resist ≥150 turns") already floated in the 2026-05-16 status.
**No code changed; no balance tuned** (D2 still 2/10, inside the "do not iterate" band). Decision forwarded to orchestrator: (Q1) which reading of "stalled" is canonical for D1 — literal `tp≤1`, or functional non-factor at T100? (Q2) the 6/8/10 numbers are all on the juiced-P0 autoplay surface (Finding A); accept that surface, or re-measure clean via p1-29g's trained/symmetric mechanism before closing? Scorer: `tools/p1-convergence-lens.py <results_dir>`.
## Why this exists separately from p1-29c
p1-29c's spec is "raise priority of Settle/Defend/Research when sole-city threatened." That work landed and is correct. The empirical failure mode is "P1 doesn't survive long enough to ACT on those priorities." That's a different code surface and a different design question — it deserves its own objective.