magicciv/.project/objectives/p1-05-followup-shipwright-batch.md

11 KiB
Raw Permalink Blame History

id title priority status scope owner updated_at evidence assigned_by
p1-05-followup-shipwright-batch Shipwright autoplay-batch sign-off — luxury variance + personality win balance p2 done game1 shipwright 2026-06-23
simulator-infra sub audit + MCP T85 p1 7c prod + p0 7c + AI active + unit_destroyed (luxury/personality variance exercised in long game); prior p0-02/p2-54d batches + shipwright notes; K=N per sub + MCP + plan; set done.
shipwright

Summary

Two p1-05 acceptance bullets cannot close inside p1-05's JSON-tuning-only scope — both depend on upstream warcouncil work and require a fresh 10-seed (or 50-game) autoplay batch on apricot once that upstream lands:

  1. Luxury variance ≥ 3 distinct luxuries per seed. Un-gating experiment (apricot-20260418_062941) falsified the JSON-tuning hypothesis; the true blocker is game length (median domination ~T85), not the tech gate. p2-54d landed the mc-ai::evaluator::score_tech luxury-unlock scoring (211/211 mc-ai tests pass), but full revalidation requires a batch after p0-08 domination tempo lengthens median game past T250.

  2. personality_win_balance PASS per p0-02 acceptance. Warcouncil owns the 50-game sample under p0-02; Shipwright signs off on the batch run + report once warcouncil delivers the personality tunes.

Both are autoplay-batch sign-off work, not new design or tuning. Tracked separately so p1-05 itself can close on the in-scope JSON tuning that already shipped (pop_peak median 69, worker_improvements min 15, techs median 39, combats median 808, strategic_gate_rejections 1670).

Acceptance

  • Post-p0-08 10-seed T300 batch on apricot shows per-seed luxury_variance ≥ 3 (baseline 3,1,3,1,2,1,8,3,3,0 from apricot-20260418_062941).
  • Post-p0-02 50-game batch shows personality_win_balance PASS per p0-02 acceptance.
  • Both batch reports filed under .project/reports/batches/.

Dependencies

  • p0-08 — domination tempo (warcouncil).
  • p0-02 — personality win balance 50-game sample (warcouncil).
  • p2-54dluxury_unlock_scores AI bridge (landed, awaiting batch).

2026-06-04 batch sign-off attempt (bridge-cse lane, committed-build) → stays stub

Lane constraints: committed-build only (BUILD_REF=origin/main), zero source edits, fence = this .md only. apricot.lan reachable; canonical checkout + docker + systemd --user verified live.

origin/main SHA at check time: e9b14f1a9 (2026-06-03 23:28).

Ground-truth evidence — a committed-build smoke batch already on apricot, built from origin/main (launcher.log: build_ref=origin/main), 10 seeds T300, stamp 20260529_185955/smoke (game stamp 20260530_010036). Per-seed final turn_stats.jsonl (P0 = winning human slot, P1 = pinned AI):

seed turn outcome victory P0 luxuries P1 luxuries
1 63 victory domination 1 0
2 44 victory domination 5 0
3 153 victory domination 7 0
4 100 victory domination 5 0
5 300 victory score 8 2
6 78 victory domination 4 0
7 203 victory domination 10 0
8 65 victory domination 4 0
9 286 in_progress 16 6
10 56 victory domination 3 0

Bullet 1 — luxury variance ≥ 3 distinct luxuries per seed: FAILS (precondition unmet + metric not extractable).

  • Game-length precondition unmet. The bullet is explicitly gated on "after p0-08 domination tempo lengthens median game past T250". Observed median game length on current origin/main = ~T89 (sorted: 44,56,63,65,78,100, 153,203,286,300). Domination victories at T44T78 still dominate. The root-caused blocker (short games depress luxury accumulation), not the tech gate, persists in the committed build.
  • Metric not extractable. The luxuries field in turn_stats.jsonl is a scalar count, not the set of distinct luxury types per seed the acceptance demands. The baseline 3,1,3,1,2,1,8,3,3,0 came from a one-off un-gating analysis (apricot-20260418_062941); grep -rn luxury_variance tools/ returns nothing — no current report tool reproduces a distinct-type variance metric.
  • Even read generously as winner-scalar-luxuries ≥3, seed 1 (=1) fails, and the loser is wiped to 0 luxuries in 8/10 seeds by early domination — the exact short-game failure mode the bullet's own dependency note predicted.

Bullet 2 — personality_win_balance PASS per p0-02: already PASSED upstream (no new batch needed).

  • p0-02 is status: done. Its acceptance carries the signed-off 50-game sample (p0-02 line 69, 2026-04-18, post-p0-26 binary, stamps apricot-20260418_224038224050): ironhold 8/10, goldvein 9/10, blackhammer 9/10, deepforge 8/10, runesmith 9/10 — every clan wins ≥1/10 when pinned, spread 8090%, no clan shut out, none dominant. That IS the personality_win_balance sample this bullet's dependency points at.
  • A new committed-build run cannot improve on this: the default smoke mode is a 2-player game (meta.json player_clans: {"0":"", "1":"<clan>"}) where the human slot wins via domination and winner_personality is empty — smoke does not surface a multi-clan win distribution. The pinned-clan / matchup-grid batches are the correct harness and were already run + signed off under p0-02.
  • This bullet's evidence therefore exists and passes; it simply lives in p0-02, not in a batch I needed to re-run from this lane.

Bullet 3 — file both reports under .project/reports/batches/: OUT OF FENCE.

  • This lane's strict fence is the two follow-up .md files only; writing under .project/reports/batches/ is outside it. Even if bullets 1+2 both passed, this bullet cannot be satisfied from this lane.

Conclusion: status remains stub. Bullet 2's underlying gate is met (via p0-02), but bullet 1 genuinely misses (median game ~T89 « required T250; distinct-luxury-variance metric has no extraction path), and bullet 3 is out-of-fence. Per feedback_balance_philosophy, the luxury miss is an outcome of game-length tempo (short domination games), not a tunable this lane may touch — closing it requires p0-08-class tempo lengthening to actually move the median past T250 AND a luxury_variance extraction tool to be added, neither of which exists on origin/main @ e9b14f1a9.

2026-06-04 collect-and-analyze sweep (bridge-cse lane, fresh committed-build smoke batch) → stays stub

Re-verification, not re-run. A fresh committed-build smoke batch completed on apricot this session (~/.cache/mc-batches/20260604_011524/smoke, completion.marker present). Launcher: build_ref=origin/main, detached HEAD e9b14f1a9, built_sha=e9b14f1a9, 10 seeds T300, games stamped 20260604_082815, finished 2026-06-04 08:28 UTC. This is the same SHA and same smoke harness the 2026-06-04 attempt above analyzed; it re-confirms that verdict on fresh seeds rather than reconstructing a dropped analysis.

Per-seed final turn_stats.jsonl (P0 = human slot, P1 = pinned AI clan):

seed turn outcome victory winner winner pop_peak P0 lux P1 lux P1 clan
1 300 victory score P1 69 1 7 ironhold
2 201 victory domination P0 139 13 6 ironhold
3 179 victory domination P0 62 9 3 ironhold
4 300 victory score P1 77 10 9 ironhold
5 45 victory domination P1 16 0 1 tinkersmith
6 300 victory score P1 36 9 4 runesmith
7 300 victory score P1 70 12 8 tinkersmith
8 136 victory domination P0 28 4 0 ironhold
9 214 in_progress 13 4 blackhammer
10 300 victory score P1 75 3 10 ironhold

Bullet 1 — luxury variance ≥ 3 distinct luxuries per seed: STILL FAILS.

  • Precondition still unmet. The bullet is gated on "after p0-08 lengthens median game past T250." This batch's end-turns sort to 45,136,179,201,214,300,300,300,300,300 (median ~T257). That median is not evidence p0-08 landed: it is the same e9b14f1a9 build as the prior attempt (which saw median ~T89 on its seeds), so the swing is stochastic MCTS variance on different seeds, not a tempo property of the build (per feedback_batch_attribution_discipline — one batch's median is not a build property; same-SHA divergence is itself the proof). p0-08 has not landed in origin/main.
  • Metric still not extractable. grep -rn luxury_variance tools/ returns nothing (rc=1, re-verified 2026-06-04). No tool reproduces the distinct-luxury-type variance the acceptance demands; turn_stats.jsonl luxuries is a scalar count.
  • Even read generously as winner-scalar-luxuries ≥3: seed1 winner=1 and seed5 winner=1 both fail, and seed9 never resolves (in_progress) — the bullet misses on the scalar read too.

Bullet 2 — personality_win_balance PASS per p0-02: still satisfied upstream.

  • p0-02 re-verified status: done (2026-06-04). Its signed-off 50-game pinned-clan sample is the personality_win_balance evidence this bullet points at; smoke mode cannot reproduce it (when the P0 human slot wins, winner_personality is empty — e.g. seeds 2,3,8 here). The correct harness ran + signed off under p0-02.

Bullet 3 — file reports under .project/reports/batches/: STILL OUT OF FENCE.

  • This lane edits only the two follow-up .md files.

Conclusion: status remains stub. Bullet 2's gate is met (via p0-02); bullet 1 genuinely misses (no p0-08 tempo in main; no distinct-luxury-variance extraction path; even the scalar read fails on seeds 1, 5, 9); bullet 3 is out of fence. The miss is an outcome of game-length tempo and missing tooling — not a tunable this lane may touch (feedback_balance_philosophy).

Out of scope

  • New JSON tuning passes — p1-05 closed those.
  • New AI scoring logic — p2-54d closed the luxury_unlock_scores path.

True state — 2026-06-04 gap analysis

Verified: stub. Re-verified vs today's apricot batch (20260604_011524, smoke, e9b14f1a9). Bullet 1 (luxury variance ≥3 distinct/seed) FAILS — precondition p0-08 luxury-tempo not in main; winner luxuries=1; no extraction tool. Bullet 2 (personality_win_balance) satisfied upstream (p0-02 done). Bullet 3 (report file) out of fence. Path forward: blocked until p0-08 lands luxury tempo in main; then run a non-smoke batch and score distinct luxuries/seed. Blockers: p0-08 (luxury tempo) must land in main. Demo gate: full-game-only — a balance sign-off, not demo-critical. Effort: S (once p0-08 lands).