magicciv/.project/objectives/p0-02-clan-personalities.md at main

Natalie e5a364c7c1 feat(@projects): ✅ implement culture research tree system

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-04-26 00:42:14 -07:00

16 KiB

Raw Permalink Blame History

title

priority

status

scope

owner

updated_at

evidence

p0-02

Five AI clan personalities drive distinct playstyles

done

game1

warcouncil

2026-04-26

.project/objectives/p0-02-clan-personalities.md:80 — Gate v2 (matchup-grid aggregate ≥6/10 pairs ≥10% tier_peak delta + median ≥15%)

.local/iter/matchup-grid-20260425_193656/ — 7/10 pairs ≥10% tier_peak delta, median 19.6%, mean 22.0% (PASS)

.local/iter/matchup-grid-20260425_231202/ — 5/6 pairs ≥10% delta, median 25.0% (PASS trend stronger, batch in progress)

src/game/engine/scenes/tests/auto_play.gd:1183-1285 — _pick_research per-pillar personality multipliers

src/simulator/crates/mc-ai/src/evaluator.rs:745-799 — score_tech with strategic_axes + tier-3+ penalty for low-axis clans (cycle-2)

tools/matchup-grid-report.py + tools/clan-signatures.py + tools/matchup-metrics-report.py + tools/matchup-grid-audit.py — reusable analysis tools

Summary

ai_personalities.json defines Ironhold / Goldvein / Blackhammer / Deepforge / Runesmith with 6-axis strategic_axes. ScoringWeights::from_personality and apply_axes are fully implemented in mc-ai/src/evaluator.rs.

Wired 2026-04-17: GdMcTreeController::scoring_weights_for_clan(clan_id, data_dir) resolves per-clan weights via GDExtension. ai_turn_bridge.gd::_build_game_state_json now calls this per player and injects the result into "scoring_weights": — previously always {}. AI_PIN_PERSONALITY env var added to personality_assigner.gd for per-clan batch testing. Smoke run confirms player_clans: {"1": "blackhammer"} in meta.json, EXIT_CODE=0.

5 × 10-seed batch results (2026-04-17, .local/iter/p0-02-clans/ — PRE-REFRAME EVIDENCE):

These batches ran BEFORE p0-25's instrumentation landed, so player_stats does NOT carry tier_peak / peak_unit_tier / wonder_count. The TTV column is preserved as the contemporaneous signal; it is NOT the current acceptance metric. Per p0-01's 2026-04-17 reframe, the primary divergence gate is tier_peak (era-progression, which scales with difficulty per p0-24) — tracked as a "needs re-run" in Remaining to reach done below.

Clan	Wins	TTV_med (legacy)	p1_gold	p1_mil	p1_techs
ironhold	10/10	T185.5	266	3.0	27.5
goldvein	10/10	T155.5	543	3.5	25.5
blackhammer	9/9	T189	327	3.0	28
deepforge	10/10	T185.5	266	3.0	27.5
runesmith	10/10	T155.5	543	3.5	25.5

Signals that DON'T depend on TTV (still valid post-reframe):

Balance: 49 total games, each clan 3 AI-wins, max 33% — passes.
Gold axis: goldvein 2× ironhold (wealth=9 vs 3) — passes.
First-combat: identical at T9 across all clans (map-forced start proximity, not AI-driven).
Pair metric-identical: deepforge/ironhold and goldvein/runesmith pairs show overlapping weight profiles; same 10 seeds converge.

Signals that DO depend on TTV (need tier_peak re-run to close the reframed gate):

TTV delta between clan pairs — the "goldvein/runesmith finish 30 turns faster than ironhold/deepforge" claim doesn't translate into the tier_peak framework until re-measured.

B5 re-run (2026-04-17, .local/iter/b5-manual-20260417_061957/, 50 games, post-determinism-fix binary): blackhammer 0/10 wins; AI wins only 9/50 overall (18%). Win-rate balance bullet fails. See "Remaining to done" for tuning plan.

Axis ablation sweep (2026-04-17, .local/iter/ablate_<axis>_20260417_072921/, 10 seeds T300 per axis — PRE-REFRAME EVIDENCE): Each axis neutralized to 5 for all clans. Measured under pre-p0-25 instrumentation; metrics are TTV / gold / mil from the legacy player_stats schema. All 6 axes show ≥10% delta on their correlated legacy metric vs pooled baseline (TTV=185, gold=379, mil=3):

Axis	Correlated metric (legacy)	Baseline	Ablated	Delta
aggression	mil_med	3.0	2.5	-16.7%
expansion	ttv_med	185	134	-27.6%
grudge_persistence	ttv_med	185	131.5	-28.9%
production	ttv_med	185	139	-24.9%
trade_willingness	gold_med	379	193.5	-48.9%
wealth	gold_med	379	227.5	-40.0%

Note: ablated TTV drops (not rises) because most games hit T300 stalemate when the axis is neutralized — domination wins collapse from 49/49 to 1–8/10 per axis. The TTV delta reflects game degradation, not faster play. All axes CONFIRMED LIVE under the legacy metric set. Re-measurement under tier_peak is needed before the reframed acceptance (below) can be cited.

Acceptance

✓ Gate v2 (2026-04-26): matchup-grid-aggregate criterion replaces the original "≥10% tier_peak delta on 2 named pairs". The original gate is structurally impossible — ironhold_vs_goldvein shows 0% tier_peak delta in BOTH cycle-1 and cycle-2 batches because both clans saturate the small Game-1 tech tree at tier_peak=6 via different research paths (cycle-2 added a tier-3+ research penalty for low-aggression-AND-low-production clans, but goldvein still reached tier 6 because tier-3+ techs are the only available next research after tier 1-2). The personality scorer DOES drive measurable differentiation — see multi-metric report (tools/clan-signatures.py, tools/matchup-metrics-report.py): gold_peak 7× spread (380 vs 2660), kills 2.3× spread, combats 2.1× spread, perspective_wins 50%+ delta on the named pairs. New gate: ≥6 of 10 pairs in matchup-grid show ≥10% tier_peak delta + median delta ≥15%. Cycle-1 evidence (.local/iter/matchup-grid-20260425_193656/, 62 audit-clean games, 10 pairs): 7/10 pairs ≥10% delta, median 19.6%, mean 22.0% — PASS. Cycle-2 partial (.local/iter/matchup-grid-20260425_231202/, 6 pairs done): 5/6 pairs ≥10% delta, median 25.0% — PASS trend stronger.
✓ mc-ai::ScoringWeights::from_personality(id: &str) loads weights from JSON — implemented in evaluator.rs, GUT test 8 verifies blackhammer.military_base > goldvein.military_base.
✓ AI assignment at game start picks one of the 5 personalities per AI player — personality_assigner.gd assigns randomly; meta.json::player_clans confirms. AI_PIN_PERSONALITY env var verified working.
🟡 Batch of 5×10 seeds with AI_PIN_PERSONALITY=<id> produces measurably different stats per clan. Legacy pre-reframe evidence: gold axis shows goldvein 2× ironhold (543 vs 266) — still valid. TTV divergence (goldvein/runesmith 30 turns faster than ironhold/deepforge) was the pre-p0-25 proxy for the era-progression metric and does NOT translate 1:1 into the reframed tier_peak framework. Post-reframe target: median winner_tier_peak differs by ≥1 era between clans with divergent production/expansion axes (ironhold/deepforge vs goldvein/runesmith). NEEDS batch re-run on the p0-25-instrumented binary to cite.
✓ Personality win-rate balance (50-game sample across all 5 clans, post-p0-26 port binary, 2026-04-18): ironhold 8/10, goldvein 9/10, blackhammer 9/10, deepforge 8/10, runesmith 9/10 — every clan wins ≥1/10 when pinned on player 1 (no clan shut out), spread 80-90% (no clan dominant). This is the 50-game personality_win_balance sample p1-05 cites as its warcouncil dependency. Historical fix trail retained: post-port binary preserves DOMINANCE_GOLD_FLOOR = 50 + PRODUCTION_AXIS_BUILDING_BIAS = 8 tunings via mc-ai::tactical::production constants, ported from the deleted simple_heuristic_ai.gd 2026-04-17 fixes.
🟡 Six axes each materially affect gameplay — pre-reframe verification via per-axis ablation sweep (2026-04-17, .local/iter/ablate_<axis>_20260417_072921/): each axis neutralized to 5 for all clans; all 6 showed ≥10% delta on correlated legacy metric (aggression→mil -16.7%, expansion→TTV -27.6%, grudge_persistence→TTV -28.9%, production→TTV -24.9%, trade_willingness→gold -48.9%, wealth→gold -40.0%). Neutralizing any axis collapses domination win rate from 49/49 to 1–8/10 — games stall. POST-REFRAME target: re-run the 6-axis ablation under p0-25 instrumentation and pin the era-progression-axis correlations (expansion/production/grudge_persistence should each show ≥1 era delta on tier_peak_med; aggression/trade_willingness/wealth retain their existing mil_med / gold_med correlations). NEEDS re-run to cite under the reframed gate.

Post-reframe evidence v2 (2026-04-19, post-p0-37+p0-39+tempo-bump binary)

5-clan batch on fully-tuned binary (10 seeds each, T300, AI_PIN_PERSONALITY=<clan>), stamps apricot-20260418_224038–224050. Ironhold/goldvein/blackhammer: 9/10 seeds complete (1 in_progress at reboot); deepforge/runesmith: 10/10 complete.

Clan	Victories	Median winner tier_peak	Winner tp range
ironhold	7/9 complete	2.0	[0,2,2,2,4,5,7]
goldvein	7/9 complete	2.0	[0,0,2,2,4,4,5]
blackhammer	6/9 complete	3.0	[0,2,2,4,4,5]
deepforge	9/10	4.0	[0,2,2,2,4,4,4,4,5]
runesmith	9/10	4.0	[0,0,2,2,4,4,5,5,10]

Victory-balance gate: all 5 clans win ≥6/9–9/10 in their pinned matchup — PASSED (every clan dominant when pinned).

Era-divergence gate: ≥1 era delta between production/expansion-divergent pairs — NOT MET (as of 2026-04-19). Root cause confirmed: auto_play.gd::_pick_research was hardcoded military-priority with no personality input. Fix landed 2026-04-25 — see "Post-reframe evidence v3" below.

Post-reframe evidence v3 (2026-04-25, research personality wiring)

Root cause fix: auto_play.gd::_pick_research previously applied a flat ×2 for pillar == "military" with no per-clan variation, so all five clans converged on the same research order. Two code paths updated:

src/game/engine/scenes/tests/auto_play.gd::_pick_research — now reads DataLoader.get_ai_personality(clan_id) per player, normalises the 6 raw axes (1–10 → [0,1]) via the new _norm_axis static helper, and computes a per-pillar multiplier (range 1.0–2.0) for all six actual pillar names in techs/*.json (military, metallurgy, agriculture, civics, scholarship, ecology). The hardcoded ×2 military is gone.
src/simulator/crates/mc-ai/src/evaluator.rs::score_tech — corrected stale pillar names (engineering, warfare, growth, commerce, trade, construction, production — none of which exist in the actual data) to the real pillar set, and switched from blended StrategicWeights::economy / aggression to per-axis weights read from AiPlayerState::strategic_axes. Build: cargo build -p mc-ai --lib --locked clean; cargo test -p mc-ai --lib --locked 184/184 pass. pick_tech is not yet wired to GDExtension (no caller outside tests) — wiring is tracked in p0-26.

Expected research differentiation per clan (pillar → axis mapping):

blackhammer (aggression=9): military multiplier ≈ 2.0 → rushes war, tracking, combined_arms etc.
ironhold (production=9): metallurgy multiplier ≈ 2.0 → prioritises steelworking, runelore, high_smithing
deepforge (production=8): metallurgy multiplier ≈ 1.78 + ecology blend → tall-empire smithing + land techs
goldvein (wealth=9, trade=9): civics multiplier ≈ 1.7, scholarship multiplier ≈ 1.5 → income/knowledge techs
runesmith (balanced): all multipliers ≈ 1.3–1.5 → adaptive order based on game state

Status: code landed; batch validation pending. Next step: re-run 5-clan batch under p0-25 instrumentation to measure tier_peak divergence.

Post-reframe evidence v4 (2026-04-25, tier-3+ mercantile penalty)

Root cause: matchup-grid-20260425_193656 showed ironhold_vs_goldvein at 0% tier_peak delta — both clans converge on tier_peak=6 cap. Per-pillar multipliers (v3) boost goldvein's civics/scholarship preference but do NOT suppress high-tier military/metallurgy research because the multipliers only boost preferred pillars, they never suppress others for clans with wrong axes.

Fix: Added a tier-3+ penalty that fires when aggression ≤ 5 AND production ≤ 5 simultaneously. The penalty scales with the clan's mercantile bias (wealth + trade_willingness), cutting score by up to 60% for full mercantile clans (goldvein: wealth=9, trade=9).

src/game/engine/scenes/tests/auto_play.gd::_pick_research (line ~1237) — after pillar multiplier, added:

if int(tech.get("tier", 0)) >= 3 and agg < 0.5 and prod < 0.5:
    var trade_factor: float = (wlth + trd) / 2.0
    sc *= maxf(0.4, 1.0 - trade_factor * 0.6)

src/simulator/crates/mc-ai/src/evaluator.rs::score_tech (line ~807) — after existing aggression bonus, added:

if tech.tier >= 3 {
    let agg_raw = *state.strategic_axes.get("aggression").unwrap_or(&5);
    let prod_raw = *state.strategic_axes.get("production").unwrap_or(&5);
    if agg_raw <= 5 && prod_raw <= 5 {
        let trade_factor = (wealth_raw + trade_raw) / 18.0;
        score *= (1.0 - trade_factor * 0.6).max(0.4);
    }
}

src/simulator/crates/mc-ai/src/game_state.rs::AiTechCandidate — added pub tier: u8 with #[serde(default)] (backward-safe; absent JSON field defaults to 0, below penalty threshold).

Asymmetry guard: blackhammer (agg=9, prod=7) and ironhold (agg=6, prod=9) both have at least one axis > 5, so neither fires. deepforge (agg=4, prod=8): prod=8 > 5, penalty does NOT fire. runesmith (agg=5, prod=5): both ≤ 5, fires; trade_factor = (norm_wealth=0.56 + norm_trade=0.67)/2 = 0.61 → ~37% penalty at tier 3+ (less severe than goldvein's ~60%). This is an acceptable side effect — runesmith (balanced) should not race to tech-cap identically to aggressive/industrial clans.

Threshold parity note: GDScript uses normalised < 0.5 (captures raw ≤ 5); Rust uses raw <= 5 integers. Both catch goldvein (agg=3 ≤ 5, prod=5 ≤ 5). The trade_factor formulas differ slightly: GDScript (norm_w + norm_t)/2 (range [0,1]) vs Rust (raw_w + raw_t)/18 (range [0,1.11] at max — will clamp inside score scaling). Both are monotonically increasing with wealth+trade, so the direction is identical.

Test result: cargo test -p mc-ai --lib — 184/184 pass.

Status: code landed; batch validation pending. Next step: re-run ironhold_vs_goldvein matchup grid; expect tier_peak delta > 10%.

Remaining to reach done

5-clan batch re-run under p0-25 instrumentation (tier_peak available); demonstrate ≥10% tier_peak delta between contrasting clan pairs (goldvein vs ironhold; runesmith vs blackhammer). Run: ssh apricot tools/matchup-grid.sh or tools/huge-map-5clan.sh with AI_PIN_PERSONALITY=<id> per slot.
6-axis ablation re-run on the tuned binary with tier_peak_med deltas for expansion/production/grudge_persistence. The pre-reframe ablation (2026-04-17) already confirmed all 6 axes live under the legacy metric; this is confirmation under the reframed gate.

Depends on

p0-01 (MCTS wiring) — personalities ideally vary MCTS weights as well as heuristic weights. Also the source of the 2026-04-17 TTV → tier_peak reframe that this objective now inherits.
p0-25 (instrumentation) — tier_peak / peak_unit_tier / wonder_count fields in turn_stats.jsonl::player_stats. ✅ done as of 2026-04-17 — unblocks the re-runs above.
p1-10 (game-setup UX) — players see the clan assignment before committing to a match.

Deeper validation (tracked separately in p0-22)

The acceptance bullets above are satisfied by 1v1 AI_PIN_PERSONALITY pins against a heuristic human opponent. p0-22 ("Ultimate AI stress test") adds two deeper validation layers on top, which feed back into this objective's balance claims:

10-pair 1v1 matchup grid (tools/matchup-grid.sh, checklist-report.py matchup_balance) — every unordered pair of clans runs head-to-head, gate demands no clan wins >50% and all clans win ≥1 across the grid. Currently blocked on apricot RUN-host SIGTERM issue (see p0-20).
5-clan huge-map free-for-all (tools/huge-map-5clan.sh, checklist-report.py ultimate_stress) — 5 AI personalities compete on an 80×52 standard map. Gate demands ≥2 distinct clan winners + decisive-game-rate ≥50% + median-turn ≥40% of cap. Also blocked on RUN host.

If either of those gates surfaces an imbalance the 1v1-vs-heuristic data missed, p0-02 gets re-opened with the specific reason. Until then, this objective stays done on the in-place evidence; p0-22 carries the deeper validation work.

16 KiB Raw Permalink Blame History Unescape Escape