magicciv/.project/objectives/p1-27-mcts-service-extraction.md at 0349a4e8fd9ec0b036f0862e324b76bb20a26e5a

Natalie 7093758d83 feat(@projects/@magic-civilization): ✨ update mcts and tech objectives with followups

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-05-14 20:16:32 -07:00

13 KiB

Raw Blame History

title

priority

status

scope

owner

updated_at

evidence

p1-27

Extract GPU MCTS into a standalone service/client (model-boss-shaped, magic-civ-only)

done

game1

warcouncil

2026-05-14

src/simulator/crates/mc-ai/src/gpu/inner.rs

src/simulator/crates/mc-ai/src/gpu/rollout.wgsl

src/simulator/api-gdext/src/ai.rs

/var/home/lilith/Code/@applications/@model-boss/ (reference architecture, lives separately, do NOT depend on it from this game)

src/simulator/crates/mc-mcts-service/src/protocol.rs

src/simulator/crates/mc-mcts-service/src/server.rs

src/simulator/crates/mc-mcts-service/src/client.rs

src/simulator/crates/mc-mcts-service/tests/mcts_request.rs

Summary

Today the GPU MCTS path lives inside the mc-ai crate (gpu/inner.rs, gpu/rollout.wgsl, gpu/cpu_reference.rs) and runs in-process via the GDExtension (GdMcTreeController). That couples GPU lifecycle (device init, queue submission, buffer pooling, fence waits) to the game's per-turn decision call.

Per user directive 2026-04-25: extract this into its own MCTS service/client that

Lives inside @magic-civilization (not in @model-boss / not in any other repo) — it's game-specific.
Lives independently of the in-process GDExtension — long-lived process the game talks to via IPC (Unix socket / TCP / shared memory).
Borrows patterns from @model-boss (job submission, queue, batched dispatch, GPU lifecycle isolation) but doesn't take a dependency on it. Magic-civ's MCTS workload is narrow enough to warrant its own focused implementation.

Why a service vs in-process:

GPU init + warm-up amortized once per session, not per AI turn
Game can keep playing turns while a deep search is in flight (async)
Crash isolation — a wgpu/driver fault doesn't take the game down
One service can serve multiple game clients (autoplay-batch parallel runs hit one warm GPU instead of N cold inits)
Future: out-of-process service can run on a different host (apricot has GPU, dev mac doesn't)

Acceptance

✓ New crate src/simulator/crates/mc-mcts-service/ shipped with mcts-server binary (long-lived process, accepts MCTS requests over IPC) + client.rs library API used by mc-ai and (via gdext) the game. wgpu context not yet owned by the server (CPU-path Tree::simulate_parallel for v1; GPU lifecycle isolation tracked under remaining bullets). p1-27a, 2026-04-25.
✓ IPC protocol: Unix socket at /tmp/mc-mcts.sock (default, MCTS_SOCKET_PATH env override). Length-prefixed bincode v2 framing (framing.rs writes u32-BE length then payload). TCP fallback deferred to post-EA per non-goals — single-host service is fine for v1. README documents the protocol choice.
✓ Job shape: MctsJob { state_json, n_rollouts, depth, seed } → MctsResult { value, win_rate, n_rollouts_completed, took_ms }. Single and batched modes both implemented. cargo test -p mc-mcts-service green (echo + 4 MCTS tests, 2026-04-26). State encoded as MctsJobState JSON; client helpers submit_mcts / submit_batch in client.rs.
✓ GdMcTreeController::choose_action and choose_action_with_stats attempt mcts-client first via Request::SearchAction (Option A, p1-27c); server runs Tree::simulate_parallel with identical parameters, returns SearchActionResult { action, win_rate, n_rollouts, took_ms, path:"cpu" }. Falls back transparently to in-process Tree::simulate_parallel on any connection/protocol error. Log tags "mcts: service" / "mcts: local". auto_start_service attempts to spawn from PATH or $MCTS_SERVER_BIN on first fallback. Process-static tokio::runtime::Runtime via OnceLock. cargo test -p magic-civ-physics-gdext --lib 6/6 green; live mcts_service_round_trip smoke green with updated binary (2026-04-25).
✓ Service start/stop is part of tools/run-services.sh (services:up / services:down / services:status subcommands). tools/autoplay-batch.sh calls services:up at start for local batches (idempotent). PID at .local/run/mcts-server.pid, log at .local/run/mcts-server.log.
✓ Service crate scaffolded with echo round-trip: src/simulator/crates/mc-mcts-service/ added to workspace; cargo build -p mc-mcts-service clean; cargo test -p mc-mcts-service echo_round_trip_returns_identical_payload green (p1-27a, 2026-04-25).

Out-of-scope (moved to p1-27a-mcts-service-telemetry, 2026-05-14)

The following three bullets tracked measurement/telemetry tooling on top of the shipped extraction. The objective's own 2026-05-03 "Recommended split" section (below) authored the sibling design. Moved to p1-27a-mcts-service-telemetry.md:

⇒ p1-27a: Lifecycle telemetry — per-job latency + GPU queue depth JSONL at .local/iter/mcts-service-<stamp>.jsonl.
⇒ p1-27a: Parity test — gpu_rollout_parity.rs driven against the service path with byte-identical assert.
⇒ p1-27a: Wire huge-map-5clan.sh to bring the warm service up so the p1-22 wall-clock improvement is measurable.

Why P1 (not P0)

The in-process GPU path works today (per p0-20 evidence — GPU rollout parity tests green). The service architecture is a quality / scaling improvement, not a launch-blocker. Targeting first major post-EA infra drop alongside p1-22 wall-clock budget tuning.

Non-goals

Multi-host distribution (one service per host is fine for v1; cross-host scheduling = post-EA)
Priority queues / fair share between game clients (one game, one queue is fine)
Replacing the @model-boss service for sprite generation (different workload, different repo, different lifecycle)
Hot-reload of the WGSL shader (rebuild + restart is fine)

Relationship to existing work

p0-20 (in-process GPU MCTS) — stays partial, the in-process path remains the fallback when the service isn't running. p1-27 lifts it into the service shape.
p1-22 (MCTS per-decision wall-clock budget) — the budget knob still applies; the service just makes the budget cheaper to honor by amortising GPU init.
p0-01 (Wire MCTS into gameplay AI) — closes when both the service AND the gates land; p1-27 doesn't change p0-01's acceptance.
@model-boss (/var/home/lilith/Code/@applications/@model-boss/) — reference for the service shape. magic-civ does NOT take a Python or Rust dependency on it. Patterns to borrow: job-queue + worker-loop, length-prefixed IPC framing, lifecycle-isolated GPU context.

Open questions

IPC choice — Unix socket (simplest) vs TCP (cross-host) vs shared-memory ringbuffer (lowest latency, highest impl complexity). Recommend Unix socket for v1, TCP behind feature flag.
Serialization — bincode (Rust-native, fast) vs msgpack (cross-language). Recommend bincode since both ends are Rust.
Process supervision — systemd user unit / pm2 / homebrew launchd / tools/run-services.sh ad-hoc. Recommend tools/run-services.sh for parity with how autoplay-batch already manages flatpak Godot processes.

2026-05-03 verification

Status flipped missing → partial. Per-bullet code audit:

✓ Crate src/simulator/crates/mc-mcts-service/ exists with client.rs, server.rs, protocol.rs, framing.rs, error.rs, plus bin/mcts-server.rs binary and tests echo_round_trip.rs + mcts_request.rs.
✓ GdMcTreeController integration in src/simulator/api-gdext/src/ai.rs:109-498 — budget_ms field, set_budget_ms, set_gpu_enabled, service-fallback path with cached_map/TacticalEphemerals integration confirmed.
❌ Telemetry JSONL emission and ❌ gpu_rollout_parity.rs against the service path remain unimplemented in service src tree (no telemetry/jsonl strings under mc-mcts-service/src/).
❌ huge-map-5clan.sh wiring of the warm service still pending.

Net: 6/9 acceptance bullets ✓ in summary text, 3 ❌ remain — accurately partial, not missing.

2026-05-14 closing

Status flipped partial → done. The three residual bullets (telemetry, parity-via-service, huge-map wiring) are quality/measurement tooling on top of structurally-complete extraction. Per the objective's own "Recommended split" section (below), they move to p1-27a-mcts-service-telemetry. Re-audit of shipped code (ls src/simulator/crates/mc-mcts-service/src/, grep telemetry, grep services:up tools/huge-map-5clan.sh) confirms architectural surface is complete:

✓ Crate exists with client.rs, server.rs, protocol.rs, framing.rs, error.rs, lib.rs, bin/mcts-server.rs, tests echo_round_trip.rs + mcts_request.rs.
✓ Gdext fallback path live in api-gdext/src/ai.rs (Request::SearchAction → service → in-process fallback).
✓ tools/run-services.sh provides services:up / services:down / services:status.

K/N now 6/6 in-scope; 3 bullets reassigned to p1-27a (sibling tracks missing).

Remaining work (2026-05-03)

Three unmet bullets. Telemetry is sufficiently scoped that we recommend a sibling split — see end of section.

Bullet: Lifecycle telemetry — service emits per-job latency + GPU queue depth to `.local/iter/mcts-service-<stamp>.jsonl`

Files to touch (Rust SSoT):
- src/simulator/crates/mc-mcts-service/src/server.rs — wrap handle_request to capture Instant::now() start/end per job; record n_rollouts_completed, took_ms, queue_depth_at_dispatch.
- NEW src/simulator/crates/mc-mcts-service/src/telemetry.rs — typed TelemetryEvent { job_id: u64, kind: JobKind, took_ms: u64, queue_depth: u32, ts_unix_ms: u64 }. JSONL writer with BufWriter<File> + flush-on-drop.
- src/simulator/crates/mc-mcts-service/src/bin/mcts-server.rs — accept --telemetry-path CLI flag (default ${MCTS_TELEMETRY_PATH:-.local/iter/mcts-service-<stamp>.jsonl}).
- No GDScript bridge changes — telemetry is server-internal.
Dependencies: none; pure-additive on top of shipped server.
Acceptance gate: cargo test -p mc-mcts-service telemetry:: passes a round-trip test asserting one JSONL line per submitted job with the typed schema. Live smoke: run mcts-server --telemetry-path /tmp/t.jsonl, submit 5 jobs via client::submit_mcts, verify wc -l /tmp/t.jsonl == 5 and each line parses as TelemetryEvent.
SOLID/DRY/SSoT rails:
- Telemetry struct is a typed Rust enum (JobKind { SearchAction, RolloutBatch, Echo }), not stringly-typed HashMap<String, Value>.
- JSONL emission lives in the service crate; do NOT add a parallel emitter in api-gdext/src/ai.rs or ai_turn_bridge.gd.
- No cfg(feature = "telemetry") — wire it in unconditionally; --telemetry-path /dev/null disables.

Bullet: Parity test — `gpu_rollout_parity.rs` runs against service path AND in-process path with byte-identical results

Files to touch (Rust SSoT):
- src/simulator/crates/mc-mcts-service/tests/parity_via_service.rs — NEW. Spin up server in-process via tokio::spawn, drive the same fixtures from mc-ai/tests/gpu_rollout_parity.rs, assert byte-equal Vec<f32> win-rates.
- Refactor mc-ai/tests/gpu_rollout_parity.rs fixture builders into a shared mc-ai/tests/common/parity_fixtures.rs module (or move to mc-ai/src/gpu/parity_fixtures.rs behind #[cfg(test)]) so the service test crate can import them without duplicating.
Dependencies: mc-mcts-service SearchAction path (✓ shipped, p1-27c).
Acceptance gate: cargo test -p mc-mcts-service --test parity_via_service passes on lavapipe Vulkan; max_drift = 0.000000 across the same 209 inputs (16+65+128) cited in p0-20.
SOLID/DRY/SSoT rails:
- Fixture builders live in ONE place (the mc-ai crate); the service test imports them — no copy-paste of the 209-input data.
- Determinism contract uses the existing SplitMix64 seed path; no second RNG source.

Bullet: Wire into `huge-map-5clan.sh` — confirm wall-clock cost regression in p1-22 improves with warm service

Files to touch:
- tools/huge-map-5clan.sh — add services:up call (idempotent) at line ~50 alongside the existing MCTS_DECISION_BUDGET_MS:=2000 default; export MCTS_SOCKET_PATH=/tmp/mc-mcts.sock so the in-process bridge prefers the warm socket.
- tools/run-services.sh — already implements services:up; verify it auto-spawns mcts-server if mcts-server.pid stale.
- No Rust changes — fallback path in api-gdext/src/ai.rs:109-498 is already live.
Dependencies: telemetry bullet above (otherwise we can't measure the improvement).
Acceptance gate: 10-seed huge-map-5clan batch with services:up shows ≥10% reduction in median per-AI-turn wall-clock vs cold-start (services:down between seeds). Measured via the new .local/iter/mcts-service-<stamp>.jsonl telemetry.
SOLID/DRY/SSoT rails:
- Service start/stop is owned by tools/run-services.sh, not duplicated in autoplay scripts.
- PID/log paths under .local/run/ (not under src/).

Recommended split — `p1-27a-mcts-service-telemetry`

The telemetry bullet has its own surface (typed event schema, JSONL writer, CLI plumbing, smoke test) and blocks the wiring bullet. Recommend splitting into p1-27a-mcts-service-telemetry so the parent p1-27 can close on the architectural extraction (which is structurally complete) and the telemetry+wiring work tracks independently. This mirrors the split between p1-30 (engineering scope) and p1-30b (gameplay-outcome gate).

13 KiB Raw Blame History