10 KiB
| id | title | priority | status | scope | owner | updated_at | evidence | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| p0-20 | GPU-accelerated MCTS rollouts for look-ahead decision-making | p0 | partial | game1 | warcouncil | 2026-04-17 |
|
Summary
The MCTS tree (mcts_tree.rs) and the mc-turn GPU fauna pipeline are both live
on main, but the AI cannot currently afford wide tree search: full
GridState cloning (~12 MB at 256×256) blows out RAM long before the tree is
deep enough to matter, and TreeState::simulate() is a 0.5 stub. This objective
introduces a GPU-batched abstract rollout layer so the tree search can
evaluate hundreds of candidate futures per leaf at single-digit-millisecond
cost.
2026-04-17 update — GPU↔CPU numerical parity ACHIEVED
Phase C structural work shipped in the earlier team pass but the parity test was silently taking the skip path on headless hosts — the shader had never actually compiled on any adapter. A deep audit + four independent fixes landed this cycle proving real numerical parity:
- WGSL reserved-keyword bug:
var active: u32 = 0uatrollout.wgsl:607used theactivereserved word → Naga parse panic → wgpu_core handler → try_init worker thread panic → timeout returned None → skip-path. Renamed toactive_idx; the shader now actually compiles. Without this, the skip-path was structurally "passing" every test in Phase C without ever exercising the WGSL kernel. - Adapter backend restriction:
wgpu::Backends::all()picked the NVIDIA OpenGL adapter first on apricot, whose compute support silently fails atrequest_device. Restricted toVULKAN | METAL | DX12 | BROWSER_WEBGPUwhich all have first-class compute paths. - Device limits fix:
Limits::default()targets a discrete GPU — too large for llvmpipe / lavapipe. Changed toLimits::downlevel_defaults().using_resolution(adapter.limits())so software Vulkan backends can satisfy device creation. - Action-walk order unified: the root numerical divergence. CPU
active_actions()returned actions in insertion order[Build, Research, Defend, Idle, Attack, ...]; WGSL iterated k=0..9 inActionKind::ALLnumerical order[Build, Attack, Settle, Research, ...]. Identical probabilities, identical RNG draw → different action picked at every cumulative-sum boundary. Rewroteactive_actions()to iterateActionKind::ALLin canonical order (with explicit docstring warning not to reorder for readability).
Parity verification on apricot (headless bluefin + lavapipe software
Vulkan): with MC_AI_GPU_DEBUG=1 VK_DRIVER_FILES=/usr/share/vulkan/icd.d/lvp_icd.x86_64.json
driving the tests on real llvmpipe dispatch, not skip-path:
[parity small_batch backend=Vulkan] n=16 agree=16/16 (1.000) max_drift=0.000000
[parity partial_workgroup backend=Vulkan] n=65 agree=65/65 (1.000) max_drift=0.000000
[parity multi_workgroup backend=Vulkan] n=128 agree=128/128 (1.000) max_drift=0.000000
buckets: <1e-6=all others=0 across all three tests
Not 98% (the stated tolerance) — 100% agreement, bit-identical on all 3
quantitative parity tests (209 inputs total). Pre-fixes: 3–6% agreement with
max_drift 0.025–0.043 (action-boundary flips). Post-fix: integer fields
byte-equal, scalar fields byte-equal. WGSL kernel is now a provable,
byte-for-byte port of rollout::walk.
2026-04-17 update — host-side infrastructure
scripts/dev-setup/bluefin.sh+./run setup:bluefin— idempotent installer forweston,vulkan-tools,mesa-vulkan-driverson bootc/Bluefin systems viarpm-ostree install --apply-live.--checkmode for CI. Delegates EDIT→RUN via$AUTOPLAY_HOSTwhen invoked from EDIT.~/Code/bootc-bluefin/containerfiles/Containerfile.desktop-coreupdated on apricot withvulkan-tools+mesa-vulkan-driversadded alongsideweston. Rebooted bootc images now include these without needing the transient script.
2026-04-17 update — fresh A5 attempt post-fix (failed on host SIGTERM)
After the four WGSL parity fixes landed and GDExtension rebuilt, fresh A5 batches were attempted under multiple process-isolation strategies:
| Strategy | Batch dir | Result |
|---|---|---|
| plain nohup | .local/iter/a5-fresh-20260417_122847/ |
exit 143, seeds in_progress T5–T10 before kill |
| nohup + new dir | .local/iter/a5-final-20260417_122936/ |
games launched, no completion.marker written (process killed) |
| bash SIGTERM trap | .local/iter/a5-trap-20260417_123021/ |
trap handler received NO signal; script exited rc=143 |
| strace signal trace | .local/iter/a5-strace-20260417_123200/ |
revealed autoplay-batch.sh exits status 1 (not 143); no SIGTERM to parent. Root cause: 0/N games produced turn_stats.jsonl check fires because flatpak Godot scopes end at 3–10s |
systemd-run --user |
.local/iter/warcouncil-a5-systemd-*/ |
same — service Active: inactive (dead) after 2s, scope children SIGTERMed |
KillMode=none |
.local/iter/warcouncil-a5-systemd-* (2nd) |
games reached T9–T10 only; same kill pattern |
plain bash autoplay-batch synchronous |
.local/iter/a5-direct-123300/ |
10 games with 0-line turn_stats.jsonl — games get SIGTERMed during map generation |
Seven distinct execution strategies, same failure pattern: flatpak Godot
scopes SIGTERMed within 3–10s of launch, before any turn completes. Investigation
found the signal is NOT delivered by systemd-oomd (failed service), rpm-ostree
automatic updates (timer inactive), or apricot-rail-watchdog (emit-only). The
actual SIGTERM source could not be identified in the apricot user session.
Parallel agent's own batches from earlier the same day (e.g.
.local/batches/blackhammer_tune_20260417_101447/) completed fine, so the
issue is transient/session-bound, NOT a permanent host failure.
Fresh A5 verdict — NOT HEALTHY, B5 therefore not launched. Per
warcouncil's integrity rule: we report the measurement failure honestly
rather than claim parity-fix-correctness translated into fresh gameplay
evidence. Existing p0-01 batch data from pre-parity-fix binary (at
blackhammer_tune_20260417_101447) still stands as the most recent
successful A5/B5 evidence in the repo.
Design outline
AbstractRolloutState— a ~256 byte#[repr(C)]Pod + Zeroablecompression of the per-player yields / force ratios / tech index / relations hash / happiness pool / strategic ledger. Fits comfortably in a GPU uniform buffer array.rollout.wgsl— a compute shader that stepsAbstractRolloutStateforward ~20 turns using a clan-parameterized heuristic policy (reading the sameStrategicWeightsthe CPU evaluator uses), then emits a win-probability estimate for the originating player.MctsEngine::batch_simulate_gpu(&[AbstractRolloutState], &GpuContext) -> Vec<f32>— dispatches 256–1024 rollouts per MCTS leaf; falls back tobatch_simulate_cpuwhen no GPU adapter is available.- Re-uses the
SplitMix64seeding / sorted-dispatch determinism contract already proven inmc-turn/src/gpu/fauna_encounter.wgsl.
Acceptance
- ✓
cargo test -p mc-ai --features gpu gpu_rollout_paritypasses — GPU batch output matches CPU reference BYTE-IDENTICAL on integer AND scalar fields (100% agreement, max_drift=0.000000) across 209 inputs (16 + 65 + 128) on lavapipe software Vulkan. Exceeded the ≥98% tolerance bullet. - ✗
AI_GPU_ROLLOUT=true ./tools/autoplay-batch.sh 10 300wall-time drops ≥20% vsAI_GPU_ROLLOUT=false— NOT YET VERIFIED. apricot (the only available RUN host) SIGTERMs any Godot flatpak cluster at 3–10s wall-clock (apparently host-infrastructure issue:apricot-rail-watchdog+ user-scope cgroup pressure; systemd-oomd failed; reproduces undernohup,setsid,systemd-run --user --scope, andsystemd-run --user --property=KillMode=none). Four failed relaunch attempts 2026-04-17 12:17 → 12:24 PDT; none of the games ran past T52 before external SIGTERM. Journal showswarcouncil-a5.service: Unit process N (timeout) remains running after unit stopped— SIGTERM came from outside the service. Needs host-side investigation of apricot's scope-kill daemon OR a different RUN host. - ✗ Victory rate on a 10-seed batch ≥60% — blocked on the same SIGTERM issue for fresh validation against the current binary. p0-01's evidence shows prior batches (pre-action-order-fix) at 80–90% victory rate; post-fix may differ but can't measure until SIGTERM issue resolved.
- ✓ wgpu version reconciled at v24 workspace-wide (
mc-turn,mc-compute,mc-ai --features gpuall compile + test clean). - ✓ Graceful CPU fallback when no GPU adapter is detected —
GpuContext::shared()returns None, top-levelbatch_simulateroutes tobatch_simulate_cpu, all parity tests take skip path cleanly on hardware-less hosts.
Remaining to reach done
- Resolve apricot SIGTERM issue (host infra, NOT warcouncil scope) OR stand up a second RUN host without the same kill daemon, then re-run the wall-time comparison batch + 10-seed victory-rate batch. Everything else in the acceptance list has been met or verified.
Depends on
p0-01— MCTS wiring must be live before GPU rollouts replace the simulate() stub.p0-02— clan personality weights are what the rollout kernel consumes to differentiate Ironhold / Blackhammer / Goldvein / Deepforge / Runesmith play.
Non-goals
- Neural-network policy / value net (AlphaZero-style). This objective delivers a heuristic-policy light rollout, not a learned policy.
- Full-fidelity GPU simulation of combat / production / city growth. Only the
abstract state advances on the GPU; the authoritative turn resolver stays on
the CPU (and owned by
mc-turn). - Replacing
simple_heuristic_ai.gd. The heuristic remains the tactical executor; MCTS only produces the strategic directive.