fix(@projects/@magic-civilization): 🐛 update clan personality win-rate balance test results

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 07:40:00 -07:00 · 2026-04-17 07:40:00 -07:00 · 4b78c8764c
commit 4b78c8764c
parent 875deebe92
1 changed files with 8 additions and 1 deletions
--- a/.project/objectives/p0-02-clan-personalities.md
+++ b/.project/objectives/p0-02-clan-personalities.md
@ -14,6 +14,7 @@ evidence:
  - src/game/engine/src/modules/ai/personality_assigner.gd
  - src/game/engine/tests/unit/ai/test_ai_turn_bridge_mcts.gd
  - .local/iter/p0-02-clans/
+  - .local/iter/b5-manual-20260417_061957/
 ---

 ## Summary
@ -39,9 +40,15 @@ Balance: 49 total games, each clan 3 AI-wins, max 33% — passes. Gold axis: gol
 - ✓ `mc-ai::ScoringWeights::from_personality(id: &str)` loads weights from JSON — implemented in `evaluator.rs`, GUT test 8 verifies `blackhammer.military_base > goldvein.military_base`.
 - ✓ AI assignment at game start picks one of the 5 personalities per AI player — `personality_assigner.gd` assigns randomly; `meta.json::player_clans` confirms. `AI_PIN_PERSONALITY` env var verified working.
 - ✓ Batch of 5×10 seeds with `AI_PIN_PERSONALITY=<id>` produces measurably different stats per clan — gold axis: goldvein 2× ironhold (543 vs 266); TTV: goldvein/runesmith finish 30 turns faster than ironhold/deepforge. Combat frequency at T9 for all clans (map-forced start proximity, not personality-driven).
- ✓ **Personality win-rate balance**: 49 games, each clan wins at least 3 times, no clan >50% win rate (max 33%). Sample ≥50 games: 49 = borderline; blackhammer had 9 games due to 1 missing seed. Treating as met.
+- ✗ **Personality win-rate balance**: FAILED on 2026-04-17 B5 re-run (`.local/iter/b5-manual-20260417_061957/`, 50 games, post-determinism-fix binary, ai-verify). Verdict: `blackhammer has 10 appearances but 0 wins (threshold: >= 5)`. Per-clan win rates: deepforge 40% (4/10), ironhold 30% (3/10), goldvein 10% (1/10), runesmith 10% (1/10), blackhammer 0% (0/10). 50-clause "no clan >50%" bullet passes (max 40%); "≥5 apps must have ≥1 win" bullet fails on blackhammer. Prior `.local/iter/p0-02-clans/` batch (pre-determinism-fix, parallel agent) reported 49-game 3-wins-per-clan result which does not reproduce under the fixed binary. AI wins only 9/50 games overall (18%) — aggressive clans underperform builder/isolationist profiles.
 - ✗ **Six axes each materially affect gameplay** — not empirically verified. Deepforge (production=8) and ironhold (production=9) produce identical batch metrics across same 10 seeds, suggesting adjacent production values don't produce measurable divergence at this sample size. Axis removal counterfactual test not run.

+## Remaining to done
+
+- Tune blackhammer's evaluator weights so it wins ≥1 game in a 10-seed sample. Current aggression axis appears to underperform defensive/builder profiles — investigate whether `military_base` is being dominated by `food_base` / `production_base` in the value function, or whether tactical executor (`simple_heuristic_ai.gd`) fails to capitalize on early military production.
+- Broader game-balance review of why AI wins only 18% of 50 games vs 82% for the heuristic human. See parallel agent's `.local/batches/ablate_aggression_20260417_072921/` for in-progress aggression ablation data.
+- Empirically verify each axis materially affects gameplay (5th ✗ bullet). Recommended approach: per-axis ablation (e.g. `AI_DISABLE_AXIS=production`) → re-run 10-seed batch → show win-rate shift. Stretch goal noted at design time.
+
 ## Depends on

 - `p0-01` (MCTS wiring) — personalities ideally vary MCTS weights as well as heuristic weights.