feat(@projects/@magic-civilization): ✨ update mcts evidence thresholds
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
7804c76a1f
commit
46aa5486f9
1 changed files with 20 additions and 10 deletions
|
|
@ -38,19 +38,29 @@ evidence:
|
|||
- `total_combats` ≥ 50 in ≥7/10 games (there was real conflict, not fold-without-fighting)
|
||||
These five sub-gates jointly measure whether games feel like a competitive 4X arc regardless of victory mode. No single "median TTV" number replaces them — game length is a *consequence*, not a target.
|
||||
|
||||
**Current evidence (2026-04-18, post-p0-26 port close):**
|
||||
Normal-vs-Normal smoke (`apricot-20260418_074209`, 10 seeds T300, AI_GPU_ROLLOUT=false) + 5 clan batches (`apricot-20260418_08*` ironhold/goldvein/blackhammer/deepforge/runesmith):
|
||||
**Current evidence (2026-04-18, post-p0-37 thresholds landing):**
|
||||
|
||||
| Batch | victories | median winner tier_peak | median peak_unit_tier | median tier_peak_gap |
|
||||
Post-p0-37 batches — personality-emergent thresholds lifted from global constants into axis-derived functions:
|
||||
|
||||
| Batch | victories | median winner tier_peak | games_any_wonder | median_turn_range |
|
||||
|---|---|---|---|---|
|
||||
| smoke (mixed) | 9/10 | 3.0 | 1.0 | ~3 |
|
||||
| ironhold | 8/10 | 3.0 | 1.0 | 3 |
|
||||
| goldvein | 9/10 | 3.0 | 1.0 | 3 |
|
||||
| blackhammer | 9/10 | 3.0 | 1.0 | 3 |
|
||||
| deepforge | 8/10 | 2.5 | 1.0 | 4 |
|
||||
| runesmith | 9/10 | 3.0 | 1.0 | 3 |
|
||||
| smoke (mixed, `apricot-20260418_120715`) | 9/10 | **4.0** | **9/10** | T39-T300 (median ~T175) |
|
||||
| ironhold (`apricot-20260418_123422`) | 9/10 | 3.0 | 7/10 | T58-T300 |
|
||||
| goldvein (`apricot-20260418_124605`) | 3+7 capped | 2.0 | 7/10 | T117-157 (wall-clock capped) |
|
||||
| blackhammer (`apricot-20260418_125238`) | 8/10 | 2.5 | 6/10 | T39-T300 |
|
||||
| deepforge (`apricot-20260418_131202`) | 9/10 | 4.0 | 7/10 | T58-T300 |
|
||||
| runesmith (`apricot-20260418_132031`) | 9/10 | 3.0 | 8/10 | T58-T300 |
|
||||
|
||||
All 5 quality sub-gates FAIL: tier_peak 2.5-3.0 vs required ≥6, peak_unit_tier 1.0 vs required ≥6 in ≥7/10, tier_peak_gap 3-4 vs required ≤2, wonder_count 0 (none built), total_combats below target. **Diagnosis**: games resolve T39-T100 via early domination before tech progresses past tier 1. This is a GAMEPLAY BALANCE issue (domination threshold too loose, tech costs too steep, or map too small), not an AI defect — MCTS correctly pursues the shortest path to victory, which happens to be rush-domination under current data.
|
||||
**Pre-p0-37 baselines** (for comparison): tier_peak uniformly 3.0 across all clans, 0/10 games built any wonder, turn cluster T39-T100.
|
||||
|
||||
**Movement**: median tier_peak 3.0 → 3.0-4.0 per-clan spread (+33% smoke); games_with_any_wonder 0/10 → 6-9/10 per clan. Games now reliably reach mid-game content.
|
||||
|
||||
**Remaining gaps vs p0-01 gates**:
|
||||
- ✗ tier_peak ≥ 6: currently 2.5-4.0. Additional tempo/tech-cost tuning could push toward 5, but **tier 6 appears gated by the tech-tree progression rate, not tactical AI** — games running to T300 still show peak_unit_tier=1 across the board.
|
||||
- ✗ peak_unit_tier ≥ 6 in ≥7/10: currently 1.0 universally. This indicates tech/unit unlocks aren't triggering, independent of game length — a **game-systems / game-data concern**, outside warcouncil scope.
|
||||
- ✗ tier_peak_gap ≤ 2: 3-4 observed. Longer games → bigger stronger-player lead. Likely improves with p0-38 PUCT divergence.
|
||||
- ✓ ≥1 wonder per player in ≥5/10 (CONFIRMED across all 5 clans post-p0-37).
|
||||
- Pending measurement: total_combats ≥ 50 in ≥7/10.
|
||||
|
||||
**Remaining to reach done:**
|
||||
1. Land `p0-37` (lift the 7 tactical constants to axis-derived functions) — primary lever per 2026-04-18 council analysis. Personality-emergent thresholds should push median game length past T250 (via cautious-clan games) and spread tier_peak across clans.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue