magicciv/.project/objectives/p0-20-gpu-mcts-rollouts.md at 2d9554d9ff31c1c67b5599d091343cf9089752d0

Natalie 2d9554d9ff feat(@projects): ✨ update wasm build and guide deployment workflows

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-04-17 13:06:14 -07:00

10 KiB

Raw Blame History

title

priority

status

scope

owner

updated_at

evidence

p0-20

GPU-accelerated MCTS rollouts for look-ahead decision-making

partial

game1

warcouncil

2026-04-17

src/simulator/crates/mc-ai/src/abstract_state.rs

src/simulator/crates/mc-ai/src/mcts_tree.rs

src/simulator/crates/mc-ai/src/rollout.rs

src/simulator/crates/mc-ai/src/gpu/inner.rs

src/simulator/crates/mc-ai/src/gpu/rollout.wgsl

src/simulator/crates/mc-ai/src/gpu/cpu_reference.rs

src/simulator/crates/mc-ai/tests/gpu_rollout_parity.rs

src/simulator/crates/mc-turn/src/gpu/mod.rs

src/simulator/crates/mc-ai/src/game_state.rs

scripts/dev-setup/bluefin.sh

Summary

The MCTS tree (mcts_tree.rs) and the mc-turn GPU fauna pipeline are both live on main, but the AI cannot currently afford wide tree search: full GridState cloning (~12 MB at 256×256) blows out RAM long before the tree is deep enough to matter, and TreeState::simulate() is a 0.5 stub. This objective introduces a GPU-batched abstract rollout layer so the tree search can evaluate hundreds of candidate futures per leaf at single-digit-millisecond cost.

2026-04-17 update — GPU↔CPU numerical parity ACHIEVED

Phase C structural work shipped in the earlier team pass but the parity test was silently taking the skip path on headless hosts — the shader had never actually compiled on any adapter. A deep audit + four independent fixes landed this cycle proving real numerical parity:

WGSL reserved-keyword bug: var active: u32 = 0u at rollout.wgsl:607 used the active reserved word → Naga parse panic → wgpu_core handler → try_init worker thread panic → timeout returned None → skip-path. Renamed to active_idx; the shader now actually compiles. Without this, the skip-path was structurally "passing" every test in Phase C without ever exercising the WGSL kernel.
Adapter backend restriction: wgpu::Backends::all() picked the NVIDIA OpenGL adapter first on apricot, whose compute support silently fails at request_device. Restricted to VULKAN | METAL | DX12 | BROWSER_WEBGPU which all have first-class compute paths.
Device limits fix: Limits::default() targets a discrete GPU — too large for llvmpipe / lavapipe. Changed to Limits::downlevel_defaults().using_resolution(adapter.limits()) so software Vulkan backends can satisfy device creation.
Action-walk order unified: the root numerical divergence. CPU active_actions() returned actions in insertion order [Build, Research, Defend, Idle, Attack, ...]; WGSL iterated k=0..9 in ActionKind::ALL numerical order [Build, Attack, Settle, Research, ...]. Identical probabilities, identical RNG draw → different action picked at every cumulative-sum boundary. Rewrote active_actions() to iterate ActionKind::ALL in canonical order (with explicit docstring warning not to reorder for readability).

Parity verification on apricot (headless bluefin + lavapipe software Vulkan): with MC_AI_GPU_DEBUG=1 VK_DRIVER_FILES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json driving the tests on real llvmpipe dispatch, not skip-path:

[parity small_batch backend=Vulkan]       n=16  agree=16/16  (1.000)  max_drift=0.000000
[parity partial_workgroup backend=Vulkan] n=65  agree=65/65  (1.000)  max_drift=0.000000
[parity multi_workgroup backend=Vulkan]   n=128 agree=128/128 (1.000) max_drift=0.000000
buckets: <1e-6=all others=0 across all three tests

Not 98% (the stated tolerance) — 100% agreement, bit-identical on all 3 quantitative parity tests (209 inputs total). Pre-fixes: 3–6% agreement with max_drift 0.025–0.043 (action-boundary flips). Post-fix: integer fields byte-equal, scalar fields byte-equal. WGSL kernel is now a provable, byte-for-byte port of rollout::walk.

2026-04-17 update — host-side infrastructure

scripts/dev-setup/bluefin.sh + ./run setup:bluefin — idempotent installer for weston, vulkan-tools, mesa-vulkan-drivers on bootc/Bluefin systems via rpm-ostree install --apply-live. --check mode for CI. Delegates EDIT→RUN via $AUTOPLAY_HOST when invoked from EDIT.
~/Code/bootc-bluefin/containerfiles/Containerfile.desktop-core updated on apricot with vulkan-tools + mesa-vulkan-drivers added alongside weston. Rebooted bootc images now include these without needing the transient script.

2026-04-17 update — fresh A5 attempt post-fix (failed on host SIGTERM)

After the four WGSL parity fixes landed and GDExtension rebuilt, fresh A5 batches were attempted under multiple process-isolation strategies:

Strategy	Batch dir	Result
plain nohup	`.local/iter/a5-fresh-20260417_122847/`	exit 143, seeds `in_progress` T5–T10 before kill
nohup + new dir	`.local/iter/a5-final-20260417_122936/`	games launched, no completion.marker written (process killed)
bash SIGTERM trap	`.local/iter/a5-trap-20260417_123021/`	trap handler received NO signal; script exited rc=143
strace signal trace	`.local/iter/a5-strace-20260417_123200/`	revealed autoplay-batch.sh exits status 1 (not 143); no SIGTERM to parent. Root cause: `0/N games produced turn_stats.jsonl` check fires because flatpak Godot scopes end at 3–10s
`systemd-run --user`	`.local/iter/warcouncil-a5-systemd-*/`	same — service `Active: inactive (dead)` after 2s, scope children SIGTERMed
`KillMode=none`	`.local/iter/warcouncil-a5-systemd-*` (2nd)	games reached T9–T10 only; same kill pattern
plain `bash autoplay-batch` synchronous	`.local/iter/a5-direct-123300/`	10 games with 0-line `turn_stats.jsonl` — games get SIGTERMed during map generation

Seven distinct execution strategies, same failure pattern: flatpak Godot scopes SIGTERMed within 3–10s of launch, before any turn completes. Investigation found the signal is NOT delivered by systemd-oomd (failed service), rpm-ostree automatic updates (timer inactive), or apricot-rail-watchdog (emit-only). The actual SIGTERM source could not be identified in the apricot user session. Parallel agent's own batches from earlier the same day (e.g. .local/batches/blackhammer_tune_20260417_101447/) completed fine, so the issue is transient/session-bound, NOT a permanent host failure.

Fresh A5 verdict — NOT HEALTHY, B5 therefore not launched. Per warcouncil's integrity rule: we report the measurement failure honestly rather than claim parity-fix-correctness translated into fresh gameplay evidence. Existing p0-01 batch data from pre-parity-fix binary (at blackhammer_tune_20260417_101447) still stands as the most recent successful A5/B5 evidence in the repo.

Design outline

AbstractRolloutState — a ~256 byte #[repr(C)] Pod + Zeroable compression of the per-player yields / force ratios / tech index / relations hash / happiness pool / strategic ledger. Fits comfortably in a GPU uniform buffer array.
rollout.wgsl — a compute shader that steps AbstractRolloutState forward ~20 turns using a clan-parameterized heuristic policy (reading the same StrategicWeights the CPU evaluator uses), then emits a win-probability estimate for the originating player.
MctsEngine::batch_simulate_gpu(&[AbstractRolloutState], &GpuContext) -> Vec<f32> — dispatches 256–1024 rollouts per MCTS leaf; falls back to batch_simulate_cpu when no GPU adapter is available.
Re-uses the SplitMix64 seeding / sorted-dispatch determinism contract already proven in mc-turn/src/gpu/fauna_encounter.wgsl.

Acceptance

✓ cargo test -p mc-ai --features gpu gpu_rollout_parity passes — GPU batch output matches CPU reference BYTE-IDENTICAL on integer AND scalar fields (100% agreement, max_drift=0.000000) across 209 inputs (16 + 65 + 128) on lavapipe software Vulkan. Exceeded the ≥98% tolerance bullet.
✗ AI_GPU_ROLLOUT=true ./tools/autoplay-batch.sh 10 300 wall-time drops ≥20% vs AI_GPU_ROLLOUT=false — NOT YET VERIFIED. apricot (the only available RUN host) SIGTERMs any Godot flatpak cluster at 3–10s wall-clock (apparently host-infrastructure issue: apricot-rail-watchdog + user-scope cgroup pressure; systemd-oomd failed; reproduces under nohup, setsid, systemd-run --user --scope, and systemd-run --user --property=KillMode=none). Four failed relaunch attempts 2026-04-17 12:17 → 12:24 PDT; none of the games ran past T52 before external SIGTERM. Journal shows warcouncil-a5.service: Unit process N (timeout) remains running after unit stopped — SIGTERM came from outside the service. Needs host-side investigation of apricot's scope-kill daemon OR a different RUN host.
✗ Victory rate on a 10-seed batch ≥60% — blocked on the same SIGTERM issue for fresh validation against the current binary. p0-01's evidence shows prior batches (pre-action-order-fix) at 80–90% victory rate; post-fix may differ but can't measure until SIGTERM issue resolved.
✓ wgpu version reconciled at v24 workspace-wide (mc-turn, mc-compute, mc-ai --features gpu all compile + test clean).
✓ Graceful CPU fallback when no GPU adapter is detected — GpuContext::shared() returns None, top-level batch_simulate routes to batch_simulate_cpu, all parity tests take skip path cleanly on hardware-less hosts.

Remaining to reach done

Resolve apricot SIGTERM issue (host infra, NOT warcouncil scope) OR stand up a second RUN host without the same kill daemon, then re-run the wall-time comparison batch + 10-seed victory-rate batch. Everything else in the acceptance list has been met or verified.

Depends on

p0-01 — MCTS wiring must be live before GPU rollouts replace the simulate() stub.
p0-02 — clan personality weights are what the rollout kernel consumes to differentiate Ironhold / Blackhammer / Goldvein / Deepforge / Runesmith play.

Non-goals

Neural-network policy / value net (AlphaZero-style). This objective delivers a heuristic-policy light rollout, not a learned policy.
Full-fidelity GPU simulation of combat / production / city growth. Only the abstract state advances on the GPU; the authoritative turn resolver stays on the CPU (and owned by mc-turn).
Replacing simple_heuristic_ai.gd. The heuristic remains the tactical executor; MCTS only produces the strategic directive.

10 KiB Raw Blame History Unescape Escape