13 KiB
| id | title | priority | status | scope | owner | updated_at | evidence | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| p1-27 | Extract GPU MCTS into a standalone service/client (model-boss-shaped, magic-civ-only) | p1 | done | game1 | warcouncil | 2026-05-14 |
|
Summary
Today the GPU MCTS path lives inside the mc-ai crate (gpu/inner.rs, gpu/rollout.wgsl, gpu/cpu_reference.rs) and runs in-process via the GDExtension (GdMcTreeController). That couples GPU lifecycle (device init, queue submission, buffer pooling, fence waits) to the game's per-turn decision call.
Per user directive 2026-04-25: extract this into its own MCTS service/client that
- Lives inside @magic-civilization (not in @model-boss / not in any other repo) — it's game-specific.
- Lives independently of the in-process GDExtension — long-lived process the game talks to via IPC (Unix socket / TCP / shared memory).
- Borrows patterns from
@model-boss(job submission, queue, batched dispatch, GPU lifecycle isolation) but doesn't take a dependency on it. Magic-civ's MCTS workload is narrow enough to warrant its own focused implementation.
Why a service vs in-process:
- GPU init + warm-up amortized once per session, not per AI turn
- Game can keep playing turns while a deep search is in flight (async)
- Crash isolation — a wgpu/driver fault doesn't take the game down
- One service can serve multiple game clients (autoplay-batch parallel runs hit one warm GPU instead of N cold inits)
- Future: out-of-process service can run on a different host (apricot has GPU, dev mac doesn't)
Acceptance
- ✓ New crate
src/simulator/crates/mc-mcts-service/shipped withmcts-serverbinary (long-lived process, accepts MCTS requests over IPC) +client.rslibrary API used bymc-aiand (via gdext) the game. wgpu context not yet owned by the server (CPU-pathTree::simulate_parallelfor v1; GPU lifecycle isolation tracked under remaining bullets). p1-27a, 2026-04-25. - ✓ IPC protocol: Unix socket at
/tmp/mc-mcts.sock(default,MCTS_SOCKET_PATHenv override). Length-prefixed bincode v2 framing (framing.rswrites u32-BE length then payload). TCP fallback deferred to post-EA per non-goals — single-host service is fine for v1. README documents the protocol choice. - ✓ Job shape:
MctsJob { state_json, n_rollouts, depth, seed }→MctsResult { value, win_rate, n_rollouts_completed, took_ms }. Single and batched modes both implemented.cargo test -p mc-mcts-servicegreen (echo + 4 MCTS tests, 2026-04-26). State encoded asMctsJobStateJSON; client helperssubmit_mcts/submit_batchinclient.rs. - ✓
GdMcTreeController::choose_actionandchoose_action_with_statsattemptmcts-clientfirst viaRequest::SearchAction(Option A, p1-27c); server runsTree::simulate_parallelwith identical parameters, returnsSearchActionResult { action, win_rate, n_rollouts, took_ms, path:"cpu" }. Falls back transparently to in-processTree::simulate_parallelon any connection/protocol error. Log tags"mcts: service"/"mcts: local".auto_start_serviceattempts to spawn from PATH or$MCTS_SERVER_BINon first fallback. Process-statictokio::runtime::RuntimeviaOnceLock.cargo test -p magic-civ-physics-gdext --lib6/6 green; livemcts_service_round_tripsmoke green with updated binary (2026-04-25). - ✓ Service start/stop is part of
tools/run-services.sh(services:up / services:down / services:status subcommands).tools/autoplay-batch.shcallsservices:upat start for local batches (idempotent). PID at.local/run/mcts-server.pid, log at.local/run/mcts-server.log. - ✓ Service crate scaffolded with echo round-trip:
src/simulator/crates/mc-mcts-service/added to workspace;cargo build -p mc-mcts-serviceclean;cargo test -p mc-mcts-serviceecho_round_trip_returns_identical_payload green (p1-27a, 2026-04-25).
Out-of-scope (moved to p1-27a-mcts-service-telemetry, 2026-05-14)
The following three bullets tracked measurement/telemetry tooling on top of the
shipped extraction. The objective's own 2026-05-03 "Recommended split" section
(below) authored the sibling design. Moved to p1-27a-mcts-service-telemetry.md:
- ⇒ p1-27a: Lifecycle telemetry — per-job latency + GPU queue depth JSONL at
.local/iter/mcts-service-<stamp>.jsonl. - ⇒ p1-27a: Parity test —
gpu_rollout_parity.rsdriven against the service path with byte-identical assert. - ⇒ p1-27a: Wire
huge-map-5clan.shto bring the warm service up so the p1-22 wall-clock improvement is measurable.
Why P1 (not P0)
The in-process GPU path works today (per p0-20 evidence — GPU rollout parity tests green). The service architecture is a quality / scaling improvement, not a launch-blocker. Targeting first major post-EA infra drop alongside p1-22 wall-clock budget tuning.
Non-goals
- Multi-host distribution (one service per host is fine for v1; cross-host scheduling = post-EA)
- Priority queues / fair share between game clients (one game, one queue is fine)
- Replacing the @model-boss service for sprite generation (different workload, different repo, different lifecycle)
- Hot-reload of the WGSL shader (rebuild + restart is fine)
Relationship to existing work
- p0-20 (in-process GPU MCTS) — stays
partial, the in-process path remains the fallback when the service isn't running. p1-27 lifts it into the service shape. - p1-22 (MCTS per-decision wall-clock budget) — the budget knob still applies; the service just makes the budget cheaper to honor by amortising GPU init.
- p0-01 (Wire MCTS into gameplay AI) — closes when both the service AND the gates land; p1-27 doesn't change p0-01's acceptance.
- @model-boss (
/var/home/lilith/Code/@applications/@model-boss/) — reference for the service shape. magic-civ does NOT take a Python or Rust dependency on it. Patterns to borrow: job-queue + worker-loop, length-prefixed IPC framing, lifecycle-isolated GPU context.
Open questions
- IPC choice — Unix socket (simplest) vs TCP (cross-host) vs shared-memory ringbuffer (lowest latency, highest impl complexity). Recommend Unix socket for v1, TCP behind feature flag.
- Serialization — bincode (Rust-native, fast) vs msgpack (cross-language). Recommend bincode since both ends are Rust.
- Process supervision — systemd user unit / pm2 / homebrew launchd /
tools/run-services.shad-hoc. Recommendtools/run-services.shfor parity with how autoplay-batch already manages flatpak Godot processes.
2026-05-03 verification
Status flipped missing → partial. Per-bullet code audit:
- ✓ Crate
src/simulator/crates/mc-mcts-service/exists withclient.rs,server.rs,protocol.rs,framing.rs,error.rs, plusbin/mcts-server.rsbinary and testsecho_round_trip.rs+mcts_request.rs. - ✓
GdMcTreeControllerintegration insrc/simulator/api-gdext/src/ai.rs:109-498—budget_msfield,set_budget_ms,set_gpu_enabled, service-fallback path withcached_map/TacticalEphemeralsintegration confirmed. - ❌ Telemetry JSONL emission and ❌
gpu_rollout_parity.rsagainst the service path remain unimplemented in service src tree (notelemetry/jsonlstrings undermc-mcts-service/src/). - ❌
huge-map-5clan.shwiring of the warm service still pending.
Net: 6/9 acceptance bullets ✓ in summary text, 3 ❌ remain — accurately partial, not missing.
2026-05-14 closing
Status flipped partial → done. The three residual bullets (telemetry, parity-via-service, huge-map wiring) are quality/measurement tooling on top of structurally-complete extraction. Per the objective's own "Recommended split" section (below), they move to p1-27a-mcts-service-telemetry. Re-audit of shipped code (ls src/simulator/crates/mc-mcts-service/src/, grep telemetry, grep services:up tools/huge-map-5clan.sh) confirms architectural surface is complete:
- ✓ Crate exists with
client.rs,server.rs,protocol.rs,framing.rs,error.rs,lib.rs,bin/mcts-server.rs, testsecho_round_trip.rs+mcts_request.rs. - ✓ Gdext fallback path live in
api-gdext/src/ai.rs(Request::SearchAction → service → in-process fallback). - ✓
tools/run-services.shprovidesservices:up/services:down/services:status.
K/N now 6/6 in-scope; 3 bullets reassigned to p1-27a (sibling tracks missing).
Remaining work (2026-05-03)
Three unmet bullets. Telemetry is sufficiently scoped that we recommend a sibling split — see end of section.
Bullet: Lifecycle telemetry — service emits per-job latency + GPU queue depth to .local/iter/mcts-service-<stamp>.jsonl
- Files to touch (Rust SSoT):
src/simulator/crates/mc-mcts-service/src/server.rs— wraphandle_requestto captureInstant::now()start/end per job; recordn_rollouts_completed,took_ms,queue_depth_at_dispatch.- NEW
src/simulator/crates/mc-mcts-service/src/telemetry.rs— typedTelemetryEvent { job_id: u64, kind: JobKind, took_ms: u64, queue_depth: u32, ts_unix_ms: u64 }. JSONL writer withBufWriter<File>+ flush-on-drop. src/simulator/crates/mc-mcts-service/src/bin/mcts-server.rs— accept--telemetry-pathCLI flag (default${MCTS_TELEMETRY_PATH:-.local/iter/mcts-service-<stamp>.jsonl}).- No GDScript bridge changes — telemetry is server-internal.
- Dependencies: none; pure-additive on top of shipped server.
- Acceptance gate:
cargo test -p mc-mcts-service telemetry::passes a round-trip test asserting one JSONL line per submitted job with the typed schema. Live smoke: runmcts-server --telemetry-path /tmp/t.jsonl, submit 5 jobs viaclient::submit_mcts, verifywc -l /tmp/t.jsonl == 5and each line parses asTelemetryEvent. - SOLID/DRY/SSoT rails:
- Telemetry struct is a typed Rust enum (
JobKind { SearchAction, RolloutBatch, Echo }), not stringly-typedHashMap<String, Value>. - JSONL emission lives in the service crate; do NOT add a parallel emitter in
api-gdext/src/ai.rsorai_turn_bridge.gd. - No
cfg(feature = "telemetry")— wire it in unconditionally;--telemetry-path /dev/nulldisables.
- Telemetry struct is a typed Rust enum (
Bullet: Parity test — gpu_rollout_parity.rs runs against service path AND in-process path with byte-identical results
- Files to touch (Rust SSoT):
src/simulator/crates/mc-mcts-service/tests/parity_via_service.rs— NEW. Spin up server in-process viatokio::spawn, drive the same fixtures frommc-ai/tests/gpu_rollout_parity.rs, assert byte-equalVec<f32>win-rates.- Refactor
mc-ai/tests/gpu_rollout_parity.rsfixture builders into a sharedmc-ai/tests/common/parity_fixtures.rsmodule (or move tomc-ai/src/gpu/parity_fixtures.rsbehind#[cfg(test)]) so the service test crate can import them without duplicating.
- Dependencies:
mc-mcts-serviceSearchAction path (✓ shipped, p1-27c). - Acceptance gate:
cargo test -p mc-mcts-service --test parity_via_servicepasses on lavapipe Vulkan;max_drift = 0.000000across the same 209 inputs (16+65+128) cited in p0-20. - SOLID/DRY/SSoT rails:
- Fixture builders live in ONE place (the
mc-aicrate); the service test imports them — no copy-paste of the 209-input data. - Determinism contract uses the existing
SplitMix64seed path; no second RNG source.
- Fixture builders live in ONE place (the
Bullet: Wire into huge-map-5clan.sh — confirm wall-clock cost regression in p1-22 improves with warm service
- Files to touch:
tools/huge-map-5clan.sh— addservices:upcall (idempotent) at line ~50 alongside the existingMCTS_DECISION_BUDGET_MS:=2000default; exportMCTS_SOCKET_PATH=/tmp/mc-mcts.sockso the in-process bridge prefers the warm socket.tools/run-services.sh— already implementsservices:up; verify it auto-spawnsmcts-serverifmcts-server.pidstale.- No Rust changes — fallback path in
api-gdext/src/ai.rs:109-498is already live.
- Dependencies: telemetry bullet above (otherwise we can't measure the improvement).
- Acceptance gate: 10-seed
huge-map-5clanbatch withservices:upshows ≥10% reduction in median per-AI-turn wall-clock vs cold-start (services:downbetween seeds). Measured via the new.local/iter/mcts-service-<stamp>.jsonltelemetry. - SOLID/DRY/SSoT rails:
- Service start/stop is owned by
tools/run-services.sh, not duplicated in autoplay scripts. - PID/log paths under
.local/run/(not undersrc/).
- Service start/stop is owned by
Recommended split — p1-27a-mcts-service-telemetry
The telemetry bullet has its own surface (typed event schema, JSONL writer, CLI plumbing, smoke test) and blocks the wiring bullet. Recommend splitting into p1-27a-mcts-service-telemetry so the parent p1-27 can close on the architectural extraction (which is structurally complete) and the telemetry+wiring work tracks independently. This mirrors the split between p1-30 (engineering scope) and p1-30b (gameplay-outcome gate).