magicciv/.project/objectives/p1-27-mcts-service-extraction.md
Natalie 7093758d83 feat(@projects/@magic-civilization): update mcts and tech objectives with followups
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-05-14 20:16:32 -07:00

13 KiB

id title priority status scope owner updated_at evidence
p1-27 Extract GPU MCTS into a standalone service/client (model-boss-shaped, magic-civ-only) p1 done game1 warcouncil 2026-05-14
src/simulator/crates/mc-ai/src/gpu/inner.rs
src/simulator/crates/mc-ai/src/gpu/rollout.wgsl
src/simulator/api-gdext/src/ai.rs
/var/home/lilith/Code/@applications/@model-boss/ (reference architecture, lives separately, do NOT depend on it from this game)
src/simulator/crates/mc-mcts-service/src/protocol.rs
src/simulator/crates/mc-mcts-service/src/server.rs
src/simulator/crates/mc-mcts-service/src/client.rs
src/simulator/crates/mc-mcts-service/tests/mcts_request.rs

Summary

Today the GPU MCTS path lives inside the mc-ai crate (gpu/inner.rs, gpu/rollout.wgsl, gpu/cpu_reference.rs) and runs in-process via the GDExtension (GdMcTreeController). That couples GPU lifecycle (device init, queue submission, buffer pooling, fence waits) to the game's per-turn decision call.

Per user directive 2026-04-25: extract this into its own MCTS service/client that

  1. Lives inside @magic-civilization (not in @model-boss / not in any other repo) — it's game-specific.
  2. Lives independently of the in-process GDExtension — long-lived process the game talks to via IPC (Unix socket / TCP / shared memory).
  3. Borrows patterns from @model-boss (job submission, queue, batched dispatch, GPU lifecycle isolation) but doesn't take a dependency on it. Magic-civ's MCTS workload is narrow enough to warrant its own focused implementation.

Why a service vs in-process:

  • GPU init + warm-up amortized once per session, not per AI turn
  • Game can keep playing turns while a deep search is in flight (async)
  • Crash isolation — a wgpu/driver fault doesn't take the game down
  • One service can serve multiple game clients (autoplay-batch parallel runs hit one warm GPU instead of N cold inits)
  • Future: out-of-process service can run on a different host (apricot has GPU, dev mac doesn't)

Acceptance

  • ✓ New crate src/simulator/crates/mc-mcts-service/ shipped with mcts-server binary (long-lived process, accepts MCTS requests over IPC) + client.rs library API used by mc-ai and (via gdext) the game. wgpu context not yet owned by the server (CPU-path Tree::simulate_parallel for v1; GPU lifecycle isolation tracked under remaining bullets). p1-27a, 2026-04-25.
  • ✓ IPC protocol: Unix socket at /tmp/mc-mcts.sock (default, MCTS_SOCKET_PATH env override). Length-prefixed bincode v2 framing (framing.rs writes u32-BE length then payload). TCP fallback deferred to post-EA per non-goals — single-host service is fine for v1. README documents the protocol choice.
  • ✓ Job shape: MctsJob { state_json, n_rollouts, depth, seed }MctsResult { value, win_rate, n_rollouts_completed, took_ms }. Single and batched modes both implemented. cargo test -p mc-mcts-service green (echo + 4 MCTS tests, 2026-04-26). State encoded as MctsJobState JSON; client helpers submit_mcts / submit_batch in client.rs.
  • GdMcTreeController::choose_action and choose_action_with_stats attempt mcts-client first via Request::SearchAction (Option A, p1-27c); server runs Tree::simulate_parallel with identical parameters, returns SearchActionResult { action, win_rate, n_rollouts, took_ms, path:"cpu" }. Falls back transparently to in-process Tree::simulate_parallel on any connection/protocol error. Log tags "mcts: service" / "mcts: local". auto_start_service attempts to spawn from PATH or $MCTS_SERVER_BIN on first fallback. Process-static tokio::runtime::Runtime via OnceLock. cargo test -p magic-civ-physics-gdext --lib 6/6 green; live mcts_service_round_trip smoke green with updated binary (2026-04-25).
  • ✓ Service start/stop is part of tools/run-services.sh (services:up / services:down / services:status subcommands). tools/autoplay-batch.sh calls services:up at start for local batches (idempotent). PID at .local/run/mcts-server.pid, log at .local/run/mcts-server.log.
  • ✓ Service crate scaffolded with echo round-trip: src/simulator/crates/mc-mcts-service/ added to workspace; cargo build -p mc-mcts-service clean; cargo test -p mc-mcts-service echo_round_trip_returns_identical_payload green (p1-27a, 2026-04-25).

Out-of-scope (moved to p1-27a-mcts-service-telemetry, 2026-05-14)

The following three bullets tracked measurement/telemetry tooling on top of the shipped extraction. The objective's own 2026-05-03 "Recommended split" section (below) authored the sibling design. Moved to p1-27a-mcts-service-telemetry.md:

  • ⇒ p1-27a: Lifecycle telemetry — per-job latency + GPU queue depth JSONL at .local/iter/mcts-service-<stamp>.jsonl.
  • ⇒ p1-27a: Parity test — gpu_rollout_parity.rs driven against the service path with byte-identical assert.
  • ⇒ p1-27a: Wire huge-map-5clan.sh to bring the warm service up so the p1-22 wall-clock improvement is measurable.

Why P1 (not P0)

The in-process GPU path works today (per p0-20 evidence — GPU rollout parity tests green). The service architecture is a quality / scaling improvement, not a launch-blocker. Targeting first major post-EA infra drop alongside p1-22 wall-clock budget tuning.

Non-goals

  • Multi-host distribution (one service per host is fine for v1; cross-host scheduling = post-EA)
  • Priority queues / fair share between game clients (one game, one queue is fine)
  • Replacing the @model-boss service for sprite generation (different workload, different repo, different lifecycle)
  • Hot-reload of the WGSL shader (rebuild + restart is fine)

Relationship to existing work

  • p0-20 (in-process GPU MCTS) — stays partial, the in-process path remains the fallback when the service isn't running. p1-27 lifts it into the service shape.
  • p1-22 (MCTS per-decision wall-clock budget) — the budget knob still applies; the service just makes the budget cheaper to honor by amortising GPU init.
  • p0-01 (Wire MCTS into gameplay AI) — closes when both the service AND the gates land; p1-27 doesn't change p0-01's acceptance.
  • @model-boss (/var/home/lilith/Code/@applications/@model-boss/) — reference for the service shape. magic-civ does NOT take a Python or Rust dependency on it. Patterns to borrow: job-queue + worker-loop, length-prefixed IPC framing, lifecycle-isolated GPU context.

Open questions

  • IPC choice — Unix socket (simplest) vs TCP (cross-host) vs shared-memory ringbuffer (lowest latency, highest impl complexity). Recommend Unix socket for v1, TCP behind feature flag.
  • Serialization — bincode (Rust-native, fast) vs msgpack (cross-language). Recommend bincode since both ends are Rust.
  • Process supervision — systemd user unit / pm2 / homebrew launchd / tools/run-services.sh ad-hoc. Recommend tools/run-services.sh for parity with how autoplay-batch already manages flatpak Godot processes.

2026-05-03 verification

Status flipped missingpartial. Per-bullet code audit:

  • ✓ Crate src/simulator/crates/mc-mcts-service/ exists with client.rs, server.rs, protocol.rs, framing.rs, error.rs, plus bin/mcts-server.rs binary and tests echo_round_trip.rs + mcts_request.rs.
  • GdMcTreeController integration in src/simulator/api-gdext/src/ai.rs:109-498budget_ms field, set_budget_ms, set_gpu_enabled, service-fallback path with cached_map/TacticalEphemerals integration confirmed.
  • Telemetry JSONL emission and gpu_rollout_parity.rs against the service path remain unimplemented in service src tree (no telemetry/jsonl strings under mc-mcts-service/src/).
  • huge-map-5clan.sh wiring of the warm service still pending.

Net: 6/9 acceptance bullets ✓ in summary text, 3 remain — accurately partial, not missing.

2026-05-14 closing

Status flipped partialdone. The three residual bullets (telemetry, parity-via-service, huge-map wiring) are quality/measurement tooling on top of structurally-complete extraction. Per the objective's own "Recommended split" section (below), they move to p1-27a-mcts-service-telemetry. Re-audit of shipped code (ls src/simulator/crates/mc-mcts-service/src/, grep telemetry, grep services:up tools/huge-map-5clan.sh) confirms architectural surface is complete:

  • ✓ Crate exists with client.rs, server.rs, protocol.rs, framing.rs, error.rs, lib.rs, bin/mcts-server.rs, tests echo_round_trip.rs + mcts_request.rs.
  • ✓ Gdext fallback path live in api-gdext/src/ai.rs (Request::SearchAction → service → in-process fallback).
  • tools/run-services.sh provides services:up / services:down / services:status.

K/N now 6/6 in-scope; 3 bullets reassigned to p1-27a (sibling tracks missing).

Remaining work (2026-05-03)

Three unmet bullets. Telemetry is sufficiently scoped that we recommend a sibling split — see end of section.

Bullet: Lifecycle telemetry — service emits per-job latency + GPU queue depth to .local/iter/mcts-service-<stamp>.jsonl

  • Files to touch (Rust SSoT):
    • src/simulator/crates/mc-mcts-service/src/server.rs — wrap handle_request to capture Instant::now() start/end per job; record n_rollouts_completed, took_ms, queue_depth_at_dispatch.
    • NEW src/simulator/crates/mc-mcts-service/src/telemetry.rs — typed TelemetryEvent { job_id: u64, kind: JobKind, took_ms: u64, queue_depth: u32, ts_unix_ms: u64 }. JSONL writer with BufWriter<File> + flush-on-drop.
    • src/simulator/crates/mc-mcts-service/src/bin/mcts-server.rs — accept --telemetry-path CLI flag (default ${MCTS_TELEMETRY_PATH:-.local/iter/mcts-service-<stamp>.jsonl}).
    • No GDScript bridge changes — telemetry is server-internal.
  • Dependencies: none; pure-additive on top of shipped server.
  • Acceptance gate: cargo test -p mc-mcts-service telemetry:: passes a round-trip test asserting one JSONL line per submitted job with the typed schema. Live smoke: run mcts-server --telemetry-path /tmp/t.jsonl, submit 5 jobs via client::submit_mcts, verify wc -l /tmp/t.jsonl == 5 and each line parses as TelemetryEvent.
  • SOLID/DRY/SSoT rails:
    • Telemetry struct is a typed Rust enum (JobKind { SearchAction, RolloutBatch, Echo }), not stringly-typed HashMap<String, Value>.
    • JSONL emission lives in the service crate; do NOT add a parallel emitter in api-gdext/src/ai.rs or ai_turn_bridge.gd.
    • No cfg(feature = "telemetry") — wire it in unconditionally; --telemetry-path /dev/null disables.

Bullet: Parity test — gpu_rollout_parity.rs runs against service path AND in-process path with byte-identical results

  • Files to touch (Rust SSoT):
    • src/simulator/crates/mc-mcts-service/tests/parity_via_service.rs — NEW. Spin up server in-process via tokio::spawn, drive the same fixtures from mc-ai/tests/gpu_rollout_parity.rs, assert byte-equal Vec<f32> win-rates.
    • Refactor mc-ai/tests/gpu_rollout_parity.rs fixture builders into a shared mc-ai/tests/common/parity_fixtures.rs module (or move to mc-ai/src/gpu/parity_fixtures.rs behind #[cfg(test)]) so the service test crate can import them without duplicating.
  • Dependencies: mc-mcts-service SearchAction path (✓ shipped, p1-27c).
  • Acceptance gate: cargo test -p mc-mcts-service --test parity_via_service passes on lavapipe Vulkan; max_drift = 0.000000 across the same 209 inputs (16+65+128) cited in p0-20.
  • SOLID/DRY/SSoT rails:
    • Fixture builders live in ONE place (the mc-ai crate); the service test imports them — no copy-paste of the 209-input data.
    • Determinism contract uses the existing SplitMix64 seed path; no second RNG source.

Bullet: Wire into huge-map-5clan.sh — confirm wall-clock cost regression in p1-22 improves with warm service

  • Files to touch:
    • tools/huge-map-5clan.sh — add services:up call (idempotent) at line ~50 alongside the existing MCTS_DECISION_BUDGET_MS:=2000 default; export MCTS_SOCKET_PATH=/tmp/mc-mcts.sock so the in-process bridge prefers the warm socket.
    • tools/run-services.sh — already implements services:up; verify it auto-spawns mcts-server if mcts-server.pid stale.
    • No Rust changes — fallback path in api-gdext/src/ai.rs:109-498 is already live.
  • Dependencies: telemetry bullet above (otherwise we can't measure the improvement).
  • Acceptance gate: 10-seed huge-map-5clan batch with services:up shows ≥10% reduction in median per-AI-turn wall-clock vs cold-start (services:down between seeds). Measured via the new .local/iter/mcts-service-<stamp>.jsonl telemetry.
  • SOLID/DRY/SSoT rails:
    • Service start/stop is owned by tools/run-services.sh, not duplicated in autoplay scripts.
    • PID/log paths under .local/run/ (not under src/).

The telemetry bullet has its own surface (typed event schema, JSONL writer, CLI plumbing, smoke test) and blocks the wiring bullet. Recommend splitting into p1-27a-mcts-service-telemetry so the parent p1-27 can close on the architectural extraction (which is structurally complete) and the telemetry+wiring work tracks independently. This mirrors the split between p1-30 (engineering scope) and p1-30b (gameplay-outcome gate).