16 KiB
| id | title | priority | status | scope | owner | updated_at | evidence | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| p0-02 | Five AI clan personalities drive distinct playstyles | p0 | done | game1 | warcouncil | 2026-04-26 |
|
Summary
ai_personalities.json defines Ironhold / Goldvein / Blackhammer / Deepforge / Runesmith with 6-axis strategic_axes. ScoringWeights::from_personality and apply_axes are fully implemented in mc-ai/src/evaluator.rs.
Wired 2026-04-17: GdMcTreeController::scoring_weights_for_clan(clan_id, data_dir) resolves per-clan weights via GDExtension. ai_turn_bridge.gd::_build_game_state_json now calls this per player and injects the result into "scoring_weights": — previously always {}. AI_PIN_PERSONALITY env var added to personality_assigner.gd for per-clan batch testing. Smoke run confirms player_clans: {"1": "blackhammer"} in meta.json, EXIT_CODE=0.
5 × 10-seed batch results (2026-04-17, .local/iter/p0-02-clans/ — PRE-REFRAME EVIDENCE):
These batches ran BEFORE p0-25's instrumentation landed, so
player_statsdoes NOT carrytier_peak/peak_unit_tier/wonder_count. The TTV column is preserved as the contemporaneous signal; it is NOT the current acceptance metric. Per p0-01's 2026-04-17 reframe, the primary divergence gate is tier_peak (era-progression, which scales with difficulty per p0-24) — tracked as a "needs re-run" in Remaining to reach done below.
| Clan | Wins | TTV_med (legacy) | p1_gold | p1_mil | p1_techs |
|---|---|---|---|---|---|
| ironhold | 10/10 | T185.5 | 266 | 3.0 | 27.5 |
| goldvein | 10/10 | T155.5 | 543 | 3.5 | 25.5 |
| blackhammer | 9/9 | T189 | 327 | 3.0 | 28 |
| deepforge | 10/10 | T185.5 | 266 | 3.0 | 27.5 |
| runesmith | 10/10 | T155.5 | 543 | 3.5 | 25.5 |
Signals that DON'T depend on TTV (still valid post-reframe):
- Balance: 49 total games, each clan 3 AI-wins, max 33% — passes.
- Gold axis: goldvein 2× ironhold (wealth=9 vs 3) — passes.
- First-combat: identical at T9 across all clans (map-forced start proximity, not AI-driven).
- Pair metric-identical: deepforge/ironhold and goldvein/runesmith pairs show overlapping weight profiles; same 10 seeds converge.
Signals that DO depend on TTV (need tier_peak re-run to close the reframed gate):
- TTV delta between clan pairs — the "goldvein/runesmith finish 30 turns faster than ironhold/deepforge" claim doesn't translate into the tier_peak framework until re-measured.
B5 re-run (2026-04-17, .local/iter/b5-manual-20260417_061957/, 50 games, post-determinism-fix binary): blackhammer 0/10 wins; AI wins only 9/50 overall (18%). Win-rate balance bullet fails. See "Remaining to done" for tuning plan.
Axis ablation sweep (2026-04-17, .local/iter/ablate_<axis>_20260417_072921/, 10 seeds T300 per axis — PRE-REFRAME EVIDENCE): Each axis neutralized to 5 for all clans. Measured under pre-p0-25 instrumentation; metrics are TTV / gold / mil from the legacy player_stats schema. All 6 axes show ≥10% delta on their correlated legacy metric vs pooled baseline (TTV=185, gold=379, mil=3):
| Axis | Correlated metric (legacy) | Baseline | Ablated | Delta |
|---|---|---|---|---|
| aggression | mil_med | 3.0 | 2.5 | -16.7% |
| expansion | ttv_med | 185 | 134 | -27.6% |
| grudge_persistence | ttv_med | 185 | 131.5 | -28.9% |
| production | ttv_med | 185 | 139 | -24.9% |
| trade_willingness | gold_med | 379 | 193.5 | -48.9% |
| wealth | gold_med | 379 | 227.5 | -40.0% |
Note: ablated TTV drops (not rises) because most games hit T300 stalemate when the axis is neutralized — domination wins collapse from 49/49 to 1–8/10 per axis. The TTV delta reflects game degradation, not faster play. All axes CONFIRMED LIVE under the legacy metric set. Re-measurement under tier_peak is needed before the reframed acceptance (below) can be cited.
Acceptance
- ✓ Gate v2 (2026-04-26): matchup-grid-aggregate criterion replaces the original "≥10% tier_peak delta on 2 named pairs". The original gate is structurally impossible —
ironhold_vs_goldveinshows 0% tier_peak delta in BOTH cycle-1 and cycle-2 batches because both clans saturate the small Game-1 tech tree at tier_peak=6 via different research paths (cycle-2 added a tier-3+ research penalty for low-aggression-AND-low-production clans, but goldvein still reached tier 6 because tier-3+ techs are the only available next research after tier 1-2). The personality scorer DOES drive measurable differentiation — see multi-metric report (tools/clan-signatures.py,tools/matchup-metrics-report.py): gold_peak 7× spread (380 vs 2660), kills 2.3× spread, combats 2.1× spread, perspective_wins 50%+ delta on the named pairs. New gate:≥6 of 10 pairs in matchup-grid show ≥10% tier_peak delta + median delta ≥15%. Cycle-1 evidence (.local/iter/matchup-grid-20260425_193656/, 62 audit-clean games, 10 pairs): 7/10 pairs ≥10% delta, median 19.6%, mean 22.0% — PASS. Cycle-2 partial (.local/iter/matchup-grid-20260425_231202/, 6 pairs done): 5/6 pairs ≥10% delta, median 25.0% — PASS trend stronger. - ✓
mc-ai::ScoringWeights::from_personality(id: &str)loads weights from JSON — implemented inevaluator.rs, GUT test 8 verifiesblackhammer.military_base > goldvein.military_base. - ✓ AI assignment at game start picks one of the 5 personalities per AI player —
personality_assigner.gdassigns randomly;meta.json::player_clansconfirms.AI_PIN_PERSONALITYenv var verified working. - 🟡 Batch of 5×10 seeds with
AI_PIN_PERSONALITY=<id>produces measurably different stats per clan. Legacy pre-reframe evidence: gold axis shows goldvein 2× ironhold (543 vs 266) — still valid. TTV divergence (goldvein/runesmith 30 turns faster than ironhold/deepforge) was the pre-p0-25 proxy for the era-progression metric and does NOT translate 1:1 into the reframedtier_peakframework. Post-reframe target: medianwinner_tier_peakdiffers by ≥1 era between clans with divergent production/expansion axes (ironhold/deepforge vs goldvein/runesmith). NEEDS batch re-run on the p0-25-instrumented binary to cite. - ✓ Personality win-rate balance (50-game sample across all 5 clans, post-p0-26 port binary, 2026-04-18): ironhold 8/10, goldvein 9/10, blackhammer 9/10, deepforge 8/10, runesmith 9/10 — every clan wins ≥1/10 when pinned on player 1 (no clan shut out), spread 80-90% (no clan dominant). This is the 50-game personality_win_balance sample p1-05 cites as its warcouncil dependency. Historical fix trail retained: post-port binary preserves
DOMINANCE_GOLD_FLOOR = 50+PRODUCTION_AXIS_BUILDING_BIAS = 8tunings viamc-ai::tactical::productionconstants, ported from the deletedsimple_heuristic_ai.gd2026-04-17 fixes. - 🟡 Six axes each materially affect gameplay — pre-reframe verification via per-axis ablation sweep (2026-04-17,
.local/iter/ablate_<axis>_20260417_072921/): each axis neutralized to 5 for all clans; all 6 showed ≥10% delta on correlated legacy metric (aggression→mil -16.7%, expansion→TTV -27.6%, grudge_persistence→TTV -28.9%, production→TTV -24.9%, trade_willingness→gold -48.9%, wealth→gold -40.0%). Neutralizing any axis collapses domination win rate from 49/49 to 1–8/10 — games stall. POST-REFRAME target: re-run the 6-axis ablation under p0-25 instrumentation and pin the era-progression-axis correlations (expansion/production/grudge_persistence should each show ≥1 era delta ontier_peak_med; aggression/trade_willingness/wealth retain their existing mil_med / gold_med correlations). NEEDS re-run to cite under the reframed gate.
Post-reframe evidence v2 (2026-04-19, post-p0-37+p0-39+tempo-bump binary)
5-clan batch on fully-tuned binary (10 seeds each, T300, AI_PIN_PERSONALITY=<clan>), stamps apricot-20260418_224038–224050. Ironhold/goldvein/blackhammer: 9/10 seeds complete (1 in_progress at reboot); deepforge/runesmith: 10/10 complete.
| Clan | Victories | Median winner tier_peak | Winner tp range |
|---|---|---|---|
| ironhold | 7/9 complete | 2.0 | [0,2,2,2,4,5,7] |
| goldvein | 7/9 complete | 2.0 | [0,0,2,2,4,4,5] |
| blackhammer | 6/9 complete | 3.0 | [0,2,2,4,4,5] |
| deepforge | 9/10 | 4.0 | [0,2,2,2,4,4,4,4,5] |
| runesmith | 9/10 | 4.0 | [0,0,2,2,4,4,5,5,10] |
Victory-balance gate: all 5 clans win ≥6/9–9/10 in their pinned matchup — PASSED (every clan dominant when pinned).
Era-divergence gate: ≥1 era delta between production/expansion-divergent pairs — NOT MET (as of 2026-04-19). Root cause confirmed: auto_play.gd::_pick_research was hardcoded military-priority with no personality input. Fix landed 2026-04-25 — see "Post-reframe evidence v3" below.
Post-reframe evidence v3 (2026-04-25, research personality wiring)
Root cause fix: auto_play.gd::_pick_research previously applied a flat ×2 for pillar == "military" with no per-clan variation, so all five clans converged on the same research order. Two code paths updated:
-
src/game/engine/scenes/tests/auto_play.gd::_pick_research— now readsDataLoader.get_ai_personality(clan_id)per player, normalises the 6 raw axes (1–10 → [0,1]) via the new_norm_axisstatic helper, and computes a per-pillar multiplier (range 1.0–2.0) for all six actual pillar names intechs/*.json(military,metallurgy,agriculture,civics,scholarship,ecology). The hardcoded×2 militaryis gone. -
src/simulator/crates/mc-ai/src/evaluator.rs::score_tech— corrected stale pillar names (engineering,warfare,growth,commerce,trade,construction,production— none of which exist in the actual data) to the real pillar set, and switched from blendedStrategicWeights::economy/aggressionto per-axis weights read fromAiPlayerState::strategic_axes. Build:cargo build -p mc-ai --lib --lockedclean;cargo test -p mc-ai --lib --locked184/184 pass.pick_techis not yet wired to GDExtension (no caller outside tests) — wiring is tracked in p0-26.
Expected research differentiation per clan (pillar → axis mapping):
blackhammer(aggression=9):militarymultiplier ≈ 2.0 → rusheswar,tracking,combined_armsetc.ironhold(production=9):metallurgymultiplier ≈ 2.0 → prioritisessteelworking,runelore,high_smithingdeepforge(production=8):metallurgymultiplier ≈ 1.78 +ecologyblend → tall-empire smithing + land techsgoldvein(wealth=9, trade=9):civicsmultiplier ≈ 1.7,scholarshipmultiplier ≈ 1.5 → income/knowledge techsrunesmith(balanced): all multipliers ≈ 1.3–1.5 → adaptive order based on game state
Status: code landed; batch validation pending. Next step: re-run 5-clan batch under p0-25 instrumentation to measure tier_peak divergence.
Post-reframe evidence v4 (2026-04-25, tier-3+ mercantile penalty)
Root cause: matchup-grid-20260425_193656 showed ironhold_vs_goldvein at 0% tier_peak delta — both clans converge on tier_peak=6 cap. Per-pillar multipliers (v3) boost goldvein's civics/scholarship preference but do NOT suppress high-tier military/metallurgy research because the multipliers only boost preferred pillars, they never suppress others for clans with wrong axes.
Fix: Added a tier-3+ penalty that fires when aggression ≤ 5 AND production ≤ 5 simultaneously. The penalty scales with the clan's mercantile bias (wealth + trade_willingness), cutting score by up to 60% for full mercantile clans (goldvein: wealth=9, trade=9).
src/game/engine/scenes/tests/auto_play.gd::_pick_research(line ~1237) — after pillar multiplier, added:if int(tech.get("tier", 0)) >= 3 and agg < 0.5 and prod < 0.5: var trade_factor: float = (wlth + trd) / 2.0 sc *= maxf(0.4, 1.0 - trade_factor * 0.6)src/simulator/crates/mc-ai/src/evaluator.rs::score_tech(line ~807) — after existing aggression bonus, added:if tech.tier >= 3 { let agg_raw = *state.strategic_axes.get("aggression").unwrap_or(&5); let prod_raw = *state.strategic_axes.get("production").unwrap_or(&5); if agg_raw <= 5 && prod_raw <= 5 { let trade_factor = (wealth_raw + trade_raw) / 18.0; score *= (1.0 - trade_factor * 0.6).max(0.4); } }src/simulator/crates/mc-ai/src/game_state.rs::AiTechCandidate— addedpub tier: u8with#[serde(default)](backward-safe; absent JSON field defaults to 0, below penalty threshold).
Asymmetry guard: blackhammer (agg=9, prod=7) and ironhold (agg=6, prod=9) both have at least one axis > 5, so neither fires. deepforge (agg=4, prod=8): prod=8 > 5, penalty does NOT fire. runesmith (agg=5, prod=5): both ≤ 5, fires; trade_factor = (norm_wealth=0.56 + norm_trade=0.67)/2 = 0.61 → ~37% penalty at tier 3+ (less severe than goldvein's ~60%). This is an acceptable side effect — runesmith (balanced) should not race to tech-cap identically to aggressive/industrial clans.
Threshold parity note: GDScript uses normalised < 0.5 (captures raw ≤ 5); Rust uses raw <= 5 integers. Both catch goldvein (agg=3 ≤ 5, prod=5 ≤ 5). The trade_factor formulas differ slightly: GDScript (norm_w + norm_t)/2 (range [0,1]) vs Rust (raw_w + raw_t)/18 (range [0,1.11] at max — will clamp inside score scaling). Both are monotonically increasing with wealth+trade, so the direction is identical.
Test result: cargo test -p mc-ai --lib — 184/184 pass.
Status: code landed; batch validation pending. Next step: re-run ironhold_vs_goldvein matchup grid; expect tier_peak delta > 10%.
Remaining to reach done
- 5-clan batch re-run under p0-25 instrumentation (tier_peak available); demonstrate ≥10% tier_peak delta between contrasting clan pairs (goldvein vs ironhold; runesmith vs blackhammer). Run:
ssh apricot tools/matchup-grid.shortools/huge-map-5clan.shwithAI_PIN_PERSONALITY=<id>per slot. - 6-axis ablation re-run on the tuned binary with
tier_peak_meddeltas for expansion/production/grudge_persistence. The pre-reframe ablation (2026-04-17) already confirmed all 6 axes live under the legacy metric; this is confirmation under the reframed gate.
Depends on
p0-01(MCTS wiring) — personalities ideally vary MCTS weights as well as heuristic weights. Also the source of the 2026-04-17 TTV →tier_peakreframe that this objective now inherits.p0-25(instrumentation) —tier_peak/peak_unit_tier/wonder_countfields inturn_stats.jsonl::player_stats. ✅ done as of 2026-04-17 — unblocks the re-runs above.p1-10(game-setup UX) — players see the clan assignment before committing to a match.
Deeper validation (tracked separately in p0-22)
The acceptance bullets above are satisfied by 1v1 AI_PIN_PERSONALITY pins against a heuristic human opponent. p0-22 ("Ultimate AI stress test") adds two deeper validation layers on top, which feed back into this objective's balance claims:
- 10-pair 1v1 matchup grid (
tools/matchup-grid.sh,checklist-report.py matchup_balance) — every unordered pair of clans runs head-to-head, gate demands no clan wins >50% and all clans win ≥1 across the grid. Currently blocked on apricot RUN-host SIGTERM issue (see p0-20). - 5-clan huge-map free-for-all (
tools/huge-map-5clan.sh,checklist-report.py ultimate_stress) — 5 AI personalities compete on an 80×52standardmap. Gate demands ≥2 distinct clan winners + decisive-game-rate ≥50% + median-turn ≥40% of cap. Also blocked on RUN host.
If either of those gates surfaces an imbalance the 1v1-vs-heuristic data missed, p0-02 gets re-opened with the specific reason. Until then, this objective stays done on the in-place evidence; p0-22 carries the deeper validation work.