feat(@projects/@magic-civilization): ✨ update quality metrics schema and tests

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-17 14:28:51 -07:00 · 2026-04-17 14:28:51 -07:00 · b1139dc32b
commit b1139dc32b
parent 43989eed82
2 changed files with 13 additions and 12 deletions
--- a/.project/objectives/p0-25-game-quality-metrics-instrumentation.md
+++ b/.project/objectives/p0-25-game-quality-metrics-instrumentation.md
@ -7,23 +7,24 @@ owner: shipwright
 status: done
 updated_at: 2026-04-17
 evidence:
-  - src/game/engine/scenes/tests/auto_play.gd
+  - src/game/engine/src/generation/auto_play.gd
  - tools/autoplay-report.py
+  - tools/autoplay-validate.py
  - tools/schemas/autoplay/turn-stats-line.json
-  - tools/test_quality_metrics.py
+  - tools/tests/test_quality_metrics.py
 ---

 ## Summary

-Added 2026-04-17 as part of the TTV → state-at-end metric reframe (see p0-01). `turn_stats.jsonl` per-player stats now carry three quality metrics: `tier_peak` (max era reached, derived from max `era` across `researched_techs`), `peak_unit_tier` (running max unit tier across all units ever alive, tracked in `_stats[idx]`), and `wonder_count` (buildings with `"flags": ["wonder"]` summed across all cities). The schema declares all three with backward-compat (not `required`, sentinel 0 for historical batches). `tools/autoplay-report.py` reports `build_quality_metrics` + `print_quality_metrics` surfacing all five p0-01 sub-gate inputs. 14 pytest tests in `tools/test_quality_metrics.py` cover schema round-trips and reporter medians.
+Added 2026-04-17 as part of the TTV → state-at-end metric reframe (see p0-01). `turn_stats.jsonl` per-player stats now carry three quality metrics: `tier_peak` (max era reached, monotonic across turns; derived each turn by folding `DataLoader.get_tech(id).era` over `player.researched_techs` in `_check_invariants`), `peak_unit_tier` (max `DataLoader.get_unit(id).tier` seen via the `EventBus.unit_created` hook in `_on_unit_created`), and `wonder_count` (entries in `GameState.wonders_built` whose value equals the player's index, computed in `_build_player_stats`). The schema declares all three with backward-compat — fields are NOT in `required`, so historical batches (pre-p0-25) still validate; the reporter treats absent fields as sentinel `-1` and filters them from medians. `tools/autoplay-report.py` adds `build_quality_metrics` + `print_quality_metrics`, surfacing winner/loser `tier_peak`, per-game `tier_peak_gap`, `peak_unit_tier` across all players, and `wonder_count_per_player`. 8 pytest tests in `tools/tests/test_quality_metrics.py` cover schema round-trips (new + old jsonl + min/max rejection) and reporter medians (new-only, mixed, old-only).

 ## Acceptance

- ✓ `turn_stats.jsonl` per-player stats carry three new fields: `tier_peak: int` (1-10), `peak_unit_tier: int` (1-10), `wonder_count: int`. Implemented in `auto_play.gd:_build_player_stats()`.
- ✓ `tools/schemas/autoplay/turn-stats-line.json` declares the three new fields in the `player_stats` definition (min/max constraints; not `required` for backward compat). Fields validated by `test_schema_accepts_new_fields` + `test_schema_rejects_tier_peak_out_of_range`.
- ✓ `tools/autoplay-report.py` computes + renders medians for all three new fields plus `tier_peak_gap_median` across seeds via `build_quality_metrics` / `print_quality_metrics`. Both CSV (`PLAYER_FIELDS`) and stdout (`print_summary`) paths covered.
- ✓ 14 pytest tests in `tools/test_quality_metrics.py`: schema validator round-trip with and without new fields; reporter medians; sentinel filtering. All pass.
- ✓ Backward compatibility: schema does not mark new fields `required`; `QUALITY_METRIC_ABSENT = -1` sentinel in reporter filters pre-p0-25 rows from median calculations. Verified by `test_schema_accepts_missing_new_fields_backward_compat` and `test_quality_metrics_absent_sentinel_filtered`.
+- ✓ `turn_stats.jsonl` per-player stats carry three new fields: `tier_peak: int` (1-10), `peak_unit_tier: int` (1-10), `wonder_count: int`. Implemented in `src/game/engine/src/generation/auto_play.gd` — fields emitted by `_build_player_stats()` (lines around 2065-2067 after the edit); tracked in `_check_invariants` (tier_peak) and `_on_unit_created` (peak_unit_tier); per-player wonder count folded from `GameState.wonders_built`.
+- ✓ `tools/schemas/autoplay/turn-stats-line.json` declares the three new fields in the `player_stats` definition with `minimum: 0, maximum: 10` for tier fields and `minimum: 0` for `wonder_count`. Not added to `required[]` so old jsonl still validates — per-field docstring spells out the sentinel-0 contract. `tools/autoplay-validate.py` extended to honor `maximum` so the 1-10 cap actually enforces.
+- ✓ `tools/autoplay-report.py` computes + renders medians for all three new fields plus `median_tier_peak_gap` across seeds via `build_quality_metrics` / `print_quality_metrics`. Both CSV (`PLAYER_FIELDS` extended with `tier_peak`, `peak_unit_tier`, `wonder_count` → p0_/p1_ columns) and stdout (`print_summary` → `print_quality_metrics` block) paths covered.
+- ✓ 8 pytest tests in `tools/tests/test_quality_metrics.py` cover: schema accepts new jsonl, schema accepts old jsonl, schema rejects `tier_peak` out of [0,10], reporter extracts new fields, reporter emits `-1` sentinel for old jsonl, reporter computes correct medians across 3 fabricated games, reporter returns `None` medians when batch has no quality data, reporter aggregates cleanly on a mixed new/old batch. Verified green on apricot — `python3 -m pytest tools/tests/test_quality_metrics.py -v` → `8 passed in 0.07s`.
+- ✓ Backward compatibility: schema does not mark new fields `required`; `QUALITY_METRIC_ABSENT = -1` sentinel in reporter filters pre-p0-25 rows from median calculations. Verified end-to-end by `python3 tools/autoplay-report.py .local/iter/blackhammer_tune_20260417_101447` (a pre-p0-25 batch) — CSV gains the new `pN_tier_peak/peak_unit_tier/wonder_count` columns filled with `-1`, the quality block prints `(no data — batch pre-dates p0-25 instrumentation)`, all assertions pass.

 ## Depends on

--- a/public/games/age-of-dwarves/data/objectives.json
+++ b/public/games/age-of-dwarves/data/objectives.json
@ -1,11 +1,11 @@
 {
-  "generated_at": "2026-04-17T21:23:15Z",
+  "generated_at": "2026-04-17T21:24:08Z",
  "totals": {
-    "stub": 1,
-    "partial": 12,
    "done": 37,
    "oos": 9,
+    "partial": 12,
    "missing": 2,
+    "stub": 1,
    "total": 61
  },
  "objectives": [
@ -257,7 +257,7 @@
      "scope": "game1",
      "owner": "shipwright",
      "updated_at": "2026-04-17",
-      "summary": "Added 2026-04-17 as part of the TTV → state-at-end metric reframe (see p0-01). `turn_stats.jsonl` per-player stats now carry three quality metrics: `tier_peak` (max era reached, derived from max `era` across `researched_techs`), `peak_unit_tier` (running max unit tier across all units ever alive, tracked in `_stats[idx]`), and `wonder_count` (buildings with `\"flags\": [\"wonder\"]` summed across all cities). The schema declares all three with backward-compat (not `required`, sentinel 0 for historical batches). `tools/autoplay-report.py` reports `build_quality_metrics` + `print_quality_metrics` surfacing all five p0-01 sub-gate inputs. 14 pytest tests in `tools/test_quality_metrics.py` cover schema round-trips and reporter medians."
+      "summary": "Added 2026-04-17 as part of the TTV → state-at-end metric reframe (see p0-01). `turn_stats.jsonl` per-player stats now carry three quality metrics: `tier_peak` (max era reached, monotonic across turns; derived each turn by folding `DataLoader.get_tech(id).era` over `player.researched_techs` in `_check_invariants`), `peak_unit_tier` (max `DataLoader.get_unit(id).tier` seen via the `EventBus.unit_created` hook in `_on_unit_created`), and `wonder_count` (entries in `GameState.wonders_built` whose value equals the player's index, computed in `_build_player_stats`). The schema declares all three with backward-compat — fields are NOT in `required`, so historical batches (pre-p0-25) still validate; the reporter treats absent fields as sentinel `-1` and filters them from medians. `tools/autoplay-report.py` adds `build_quality_metrics` + `print_quality_metrics`, surfacing winner/loser `tier_peak`, per-game `tier_peak_gap`, `peak_unit_tier` across all players, and `wonder_count_per_player`. 8 pytest tests in `tools/tests/test_quality_metrics.py` cover schema round-trips (new + old jsonl + min/max rejection) and reporter medians (new-only, mixed, old-only)."
    },
    {
      "id": "p1-01",