3.9 KiB
| id | title | priority | status | scope | owner | updated_at | evidence | ||
|---|---|---|---|---|---|---|---|---|---|
| g2-04 | Multi-GPU sharding for batch_simulate_gpu — out-of-scope (Game 2) | p3 | oos | game2 | warcouncil | 2026-04-17 |
|
Out of scope for Game 1 "Age of Dwarves". Game 1 batch sizes (64–256 rollouts per MCTS leaf) are nowhere near single-GPU saturation on apricot's RTX 3090, and the shippable Game 1 machine profile isn't guaranteed to have two GPUs anyway. This optimization is queued for Game 2 "Age of Kzzykt" when deeper MCTS lookahead + larger clan-on-clan batches are expected to push single-GPU throughput to the ceiling.
Summary
mc-ai::gpu::inner::GpuContext::shared() (at src/simulator/crates/mc-ai/src/gpu/inner.rs:189) picks exactly ONE adapter via instance.request_adapter(PowerPreference::HighPerformance). On multi-GPU hosts this leaves every adapter past #0 idle from our compute perspective.
apricot has 2× NVIDIA RTX 3090. Right now batch_simulate_gpu uses one of them (whichever wgpu selects — typically GPU0). GPU1 sits at 0% compute util from our workload. p0-20 wall-time comparisons are therefore measured against a halved ceiling.
Acceptance
- ✗ New
MultiGpuContext(orGpuContext::all()variant) that ownsVec<GpuDevice>— one per adapter returned fromenumerate_adapters(backends)filtered to compute-capable. EachGpuDevicewraps the existingdevice + queue + pipeline + bind_group_layouttriple. - ✗
batch_simulate_gpu_multi(inputs, priors, seed, horizon)shards inputs across devices round-robin (batch_index % num_devices) and dispatches in parallel via rayon. Results concatenated in original batch-index order. - ✗ Determinism gate: for the same
(inputs, priors, seed, horizon), multi-device output MUST equal single-device output bit-identically on integer fields + within existing parity tolerance on scalar fields. New test:gpu_multi_adapter_parityinsrc/simulator/crates/mc-ai/tests/gpu_rollout_parity.rsthat skips cleanly whenenumerate_adapters().count() < 2. - ✗ Fallback: if one adapter fails mid-batch (e.g., VRAM exhausted — apricot's GPU0 shares VRAM with user's llama-server at ~13 GiB resident), degrade gracefully to the remaining adapter(s) or CPU. Partial-failure must not panic.
- ✗
Tree::iterate_gpu_batchedcaller opts into multi-GPU via a newGpuContextvariant selector, default remains single-GPU for backward compat. - ✗ p0-20 wall-time comparison re-run with multi-GPU enabled; cite measured throughput delta vs single-GPU. Expect ~2× for adequately-sized batches (probably ≥128 rollouts per dispatch given dispatch overhead).
Depends on
p0-20— must be closed first. Single-GPU path has to be measured + verified BEFORE we start sharding across multiple devices (otherwise regression is untraceable). p0-20 closure unblocks this objective.
Non-goals
- Cross-node GPU (multi-machine). Single-host, single-process only.
- Dynamic rebalancing if one device is slower than another. Round-robin is sufficient for homogeneous multi-GPU hosts (apricot's 2× 3090). Heterogeneous hosts (e.g., one 3090 + one 4090) would need weighted sharding — separate concern.
- Cross-GPU shared-memory tricks. Each device gets its own buffer set.
Why this exists
apricot is a 2× RTX 3090 box; using only one halves throughput on any compute-bound rollout workload. With MCTS rollout budgets likely to grow beyond today's 64-256 per dispatch as the AI matures (deeper lookahead, bigger abstract-state POD, or multi-clan self-play), single-GPU will become the bottleneck. Landing the multi-GPU shim now means we don't have to re-architect mc-ai when we hit that ceiling.
Also relevant: hosts with coresident GPU workloads (user's llama-server holding ~13 GiB on one 3090) can have per-device VRAM pressure. Sharding lets us use the quiet device even when the other is saturated by an unrelated workload.