magicciv/.project/objectives/p0-20d-gpu-walltime-real-host.md
Natalie 9b1e55b866 feat(@projects/@magic-civilization): mark p0-20 gpu-mcts rollouts as complete
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-05-05 02:19:10 -04:00

3.1 KiB
Raw Blame History

id title priority status scope owner updated_at evidence blocked_by
p0-20d GPU MCTS wall-time gate — measure on real-discrete-GPU test host p1 superseded game1 warcouncil 2026-05-05
plum Apple Silicon Metal IS the discrete-GPU test host — wall-time gate met directly under p0-20 Phase D close-out

Context

Phase A + B + C of p0-20-gpu-mcts-rollouts shipped the full architectural rewrite to main:

  • Live game uses Tree<GameRolloutState> exclusively, dispatched via AiBackend::probe()-routed iterate_gpu_batched.
  • Persistent staging buffer pool + 1024-rollout coalesced wgpu::Queue::submit (Phase B).
  • AI quality empirically validated as byte-identical to the pre-Phase-C Tree<McSnapshot> baseline (50 seeds × normal+hard, identical clan-win distribution: 35/3/4/2/3/3).

The one un-met acceptance bullet from the parent objective is the wall-time win: AI_GPU_ROLLOUT=true ./tools/autoplay-batch.sh 10 300 median wall-time ≤ 0.80 × CPU. (Note: AI_GPU_ROLLOUT env var was deleted in Phase C; the equivalent now is MC_AI_BACKEND=gpu vs MC_AI_BACKEND=cpu. The apricot-run.sh script's gpu-walltime mode still references the old env var name and needs updating.)

apricot's bluefin host probes Cpu because no compute-capable discrete GPU adapter is available. Software Vulkan (lavapipe) is available for parity tests but is slower than CPU rayon — wrong direction for the wall-time gate. Measuring requires a host with a working hardware compute GPU.

Acceptance

  • Identify a test host with working compute-capable discrete GPU (candidates: plum's Apple Silicon Metal, a CUDA-capable box, an Intel/AMD discrete-GPU upgrade for apricot).
  • Update scripts/apricot-run.sh::gpu-walltime mode to use MC_AI_BACKEND=gpu|cpu instead of the deleted AI_GPU_ROLLOUT env var.
  • Run gpu-walltime 10 300 on the chosen host. Median seed wall-time at MC_AI_BACKEND=gpu ≤ 0.80 × MC_AI_BACKEND=cpu.
  • If gate misses on first attempt, classify per the Phase B handoff: dispatch overhead (raise MAX_BATCH from 1024), readback latency (real ring-buffer overlap — Phase B explicitly skipped), or kernel time (WGSL-level optimization).

Source-of-truth rails

  • Rust crate: mc-ai::gpu owns dispatch + WGSL shader. Algorithm UNCHANGED from cpu_reference; backend just selects executor.
  • JSON path: no new data; existing balance/setup configs apply.
  • mc-core wrapper: existing AiBackend enum (Phase 1).
  • No new fallback paths. Backend probe at boot decides; runtime stays.

Out of scope

  • Algorithm changes. Same byte-identical algorithm via cpu_reference::batch_simulate_cpu and rollout.wgsl.
  • New AbstractPlayerState fields. The 64-byte/player layout is empirically sufficient (validated in p0-20 Phase C).
  • Multi-GPU sharding. Single-GPU coalescing is the in-scope path.

References

  • Parent: .project/objectives/p0-20-gpu-mcts-rollouts.md (status: partial after Phase C close-out)
  • Phase B handoff: p0-20-gpu-mcts-rollouts.md::#phase-b-2026-05-04--dispatch-coalescing
  • Phase C close-out: p0-20-gpu-mcts-rollouts.md::#phase-c-close-out-2026-05-04