magicciv/.project/objectives/g2-04-multi-gpu-batch-simulate-oos.md at 75fdf14f4d7c0e3c8bb868cda8af09c3b822f940

Natalie 98402e156e feat(@projects): ✨ add climate and ecology systems

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-04-17 23:19:41 -07:00

3.9 KiB

Raw Blame History

title

priority

status

scope

owner

updated_at

evidence

g2-04

Multi-GPU sharding for batch_simulate_gpu — out-of-scope (Game 2)

oos

game2

warcouncil

2026-04-17

src/simulator/crates/mc-ai/src/gpu/inner.rs

src/simulator/crates/mc-ai/src/gpu/mod.rs

Out of scope for Game 1 "Age of Dwarves". Game 1 batch sizes (64–256 rollouts per MCTS leaf) are nowhere near single-GPU saturation on apricot's RTX 3090, and the shippable Game 1 machine profile isn't guaranteed to have two GPUs anyway. This optimization is queued for Game 2 "Age of Kzzykt" when deeper MCTS lookahead + larger clan-on-clan batches are expected to push single-GPU throughput to the ceiling.

Summary

mc-ai::gpu::inner::GpuContext::shared() (at src/simulator/crates/mc-ai/src/gpu/inner.rs:189) picks exactly ONE adapter via instance.request_adapter(PowerPreference::HighPerformance). On multi-GPU hosts this leaves every adapter past #0 idle from our compute perspective.

apricot has 2× NVIDIA RTX 3090. Right now batch_simulate_gpu uses one of them (whichever wgpu selects — typically GPU0). GPU1 sits at 0% compute util from our workload. p0-20 wall-time comparisons are therefore measured against a halved ceiling.

Acceptance

✗ New MultiGpuContext (or GpuContext::all() variant) that owns Vec<GpuDevice> — one per adapter returned from enumerate_adapters(backends) filtered to compute-capable. Each GpuDevice wraps the existing device + queue + pipeline + bind_group_layout triple.
✗ batch_simulate_gpu_multi(inputs, priors, seed, horizon) shards inputs across devices round-robin (batch_index % num_devices) and dispatches in parallel via rayon. Results concatenated in original batch-index order.
✗ Determinism gate: for the same (inputs, priors, seed, horizon), multi-device output MUST equal single-device output bit-identically on integer fields + within existing parity tolerance on scalar fields. New test: gpu_multi_adapter_parity in src/simulator/crates/mc-ai/tests/gpu_rollout_parity.rs that skips cleanly when enumerate_adapters().count() < 2.
✗ Fallback: if one adapter fails mid-batch (e.g., VRAM exhausted — apricot's GPU0 shares VRAM with user's llama-server at ~13 GiB resident), degrade gracefully to the remaining adapter(s) or CPU. Partial-failure must not panic.
✗ Tree::iterate_gpu_batched caller opts into multi-GPU via a new GpuContext variant selector, default remains single-GPU for backward compat.
✗ p0-20 wall-time comparison re-run with multi-GPU enabled; cite measured throughput delta vs single-GPU. Expect ~2× for adequately-sized batches (probably ≥128 rollouts per dispatch given dispatch overhead).

Depends on

p0-20 — must be closed first. Single-GPU path has to be measured + verified BEFORE we start sharding across multiple devices (otherwise regression is untraceable). p0-20 closure unblocks this objective.

Non-goals

Cross-node GPU (multi-machine). Single-host, single-process only.
Dynamic rebalancing if one device is slower than another. Round-robin is sufficient for homogeneous multi-GPU hosts (apricot's 2× 3090). Heterogeneous hosts (e.g., one 3090 + one 4090) would need weighted sharding — separate concern.
Cross-GPU shared-memory tricks. Each device gets its own buffer set.

Why this exists

apricot is a 2× RTX 3090 box; using only one halves throughput on any compute-bound rollout workload. With MCTS rollout budgets likely to grow beyond today's 64-256 per dispatch as the AI matures (deeper lookahead, bigger abstract-state POD, or multi-clan self-play), single-GPU will become the bottleneck. Landing the multi-GPU shim now means we don't have to re-architect mc-ai when we hit that ceiling.

Also relevant: hosts with coresident GPU workloads (user's llama-server holding ~13 GiB on one 3090) can have per-device VRAM pressure. Sharding lets us use the quiet device even when the other is saturated by an unrelated workload.

3.9 KiB Raw Blame History Unescape Escape