From baf2f856288db1c63cfb363f1c3fbb0784f1cd7f Mon Sep 17 00:00:00 2001 From: Natalie Date: Sat, 25 Apr 2026 02:33:15 -0700 Subject: [PATCH] =?UTF-8?q?fix(@projects/@magic-civilization):=20?= =?UTF-8?q?=F0=9F=90=9B=20update=20latency=20acceptance=20criteria?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- .project/objectives/p2-05-turn-latency.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/.project/objectives/p2-05-turn-latency.md b/.project/objectives/p2-05-turn-latency.md index c3aec5b2..7aa1a8cf 100644 --- a/.project/objectives/p2-05-turn-latency.md +++ b/.project/objectives/p2-05-turn-latency.md @@ -17,6 +17,10 @@ evidence: ## Acceptance -- `tools/measure-turn-latency.py` (new) profiles 300 turns and emits `p50 / p90 / p99` latency. -- p99 ≤ 1.0 s for the 3-AI normal difficulty 512-tile scenario. -- Flame graph under `.project/reports/latency/` if regressions appear. +- ✓ `tools/measure-turn-latency.py` exists and profiles N turns from a batch dir, emitting `p50 / p90 / p99` (latencies derived from cumulative `wall_clock_sec` deltas in turn_stats.jsonl). Verified working against `p0-01-chain-20260424_093210/` on apricot (1191 samples). +- ✗ p99 ≤ 1.0 s on the canonical scenario — **gate calibration mismatch**: + - Specified scenario "3-AI normal difficulty 512-tile" doesn't match any current map config; smallest 4-player map is 52×32 = 1664 tiles. The "512-tile" target predates current map sizing. + - Measured on the 2-AI duel-map MCTS chain batch (closest analogue to "3-AI 512-tile"): p50=0.57s, p90=2.16s, p99=4.39s, max=6.29s (1191 samples across 6 seeds). p99 is ~4.4× the gate. + - Per-turn cost is dominated by MCTS rollout depth (~100-500ms per AI player). 3 AI players at MCTS depth 20 ≈ 600-1500ms baseline regardless of map size; the 1s p99 gate was set before MCTS+PUCT shipped (p0-01/p0-38) and would now require either a shallower default MCTS depth or rewriting the gate to match current AI architecture. + - Recommendation: rewrite gate to `p50 ≤ 1.0s` (currently 0.57s — passing) AND keep p99 as a softer "regression watch" metric (current ~4s baseline; alert on >2× regression). +- ✗ Flame graph under `.project/reports/latency/` — would only matter if we accept the current p99 as the regression baseline first.