feat(@projects): ✨ add async batch protocol docs

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-05-05 14:06:40 -04:00 · 2026-05-05 14:06:40 -04:00 · 2e331d2b07
commit 2e331d2b07
parent fbaca9de95
1 changed files with 49 additions and 0 deletions
--- a/.project/objectives/p2-64-apricot-async-batch-protocol.md
+++ b/.project/objectives/p2-64-apricot-async-batch-protocol.md
@ -0,0 +1,49 @@
+---
+id: p2-64
+title: Apricot async batch protocol — launch / status / fetch decoupling
+priority: p2
+status: stub
+scope: game1
+category: infra
+owner: simulator-infra
+created: 2026-05-05
+updated_at: 2026-05-05
+blocked_by: []
+follow_ups: []
+---
+
+## Context
+
+Today `scripts/apricot-run.sh` runs a single synchronous flow: fetch → worktree → build → batch → fetch verdict → cleanup. The orchestration runs on plum and SSHes to apricot multiple times. When apricot connectivity drops mid-run (intermittent network, sshd channel saturation, sleep/wake), the local script aborts before fetching results — even though the apricot-side godot processes continue and write results to `.cache/mc-batches/<stamp>/` regardless.
+
+This couples job lifecycle to live SSH and forces every wake to do expensive ssh probes. With intermittent connectivity, that costs lost orchestration even when no work was lost.
+
+## Acceptance
+
+- ❌ `scripts/apricot-run.sh launch <mode> <args>` — fires the orchestration entirely on apricot via `systemd-run --user --unit=mc-batch-<stamp> --collect`. Returns immediately with `STAMP=<value>` on stdout (one line, scriptable). The systemd unit owns build + batch lifecycle; survives SSH disconnects.
+- ❌ `scripts/apricot-run.sh status <stamp>` — single short SSH probe (`ConnectTimeout=5`), structured stdout: `{"state":"running|complete|failed|unreachable","seeds_done":N,"seeds_total":M,"completion_marker":bool}`. Tolerates SSH timeouts (returns `unreachable` on probe failure).
+- ❌ `scripts/apricot-run.sh fetch <stamp>` — `rsync -a --partial` pulls `~/.cache/mc-batches/<stamp>/` to `.local/iter/<stamp>/`. Resumable. Exits 1 if batch isn't complete yet (so callers can retry).
+- ❌ Existing synchronous modes (`smoke`, `huge-map-5clan`, `ai-quality-baseline-pre-c`, etc.) keep working — `launch` is a new sub-mode that wraps them, not a replacement. Backwards-compat for callers that DO want to block.
+- ❌ Documentation in `scripts/apricot-run.sh` header + a short example snippet in `.claude/instructions/canonical-commands.md` showing the launch/status/fetch loop.
+- ❌ `mc-batch-<stamp>.service` systemd-unit template lives at `scripts/dev-setup/mc-batch.service.in` (or inline in apricot-run.sh) — instantiated per-stamp via `systemd-run --user --unit=...`, with `KillMode=mixed` + `TimeoutStopSec=10s` for clean shutdown.
+
+## Source-of-truth rails
+
+- **Bash crate**: `scripts/apricot-run.sh` owns the protocol. No GDScript dependency.
+- **systemd unit**: `--user` scope (per-user lifecycle), `--collect` (auto-clean on success), `--unit=mc-batch-<stamp>` (named for status query).
+- **Result location**: existing `~/.cache/mc-batches/<stamp>/` — no new path.
+- **Status output**: JSON, machine-readable. No prose.
+- **No backwards-compat shim** for the old single-call pattern — both work, the user picks.
+
+## Out of scope
+
+- Cross-host queueing (only one batch can run at a time on apricot today; that's fine).
+- Retry/resume of the build phase (if apricot reboots mid-build, the unit fails and a new `launch` is required).
+- Result-collection across multiple stamps in one fetch call (batch by batch).
+
+## References
+
+- `scripts/apricot-run.sh` — current synchronous flow.
+- `.claude/instructions/canonical-commands.md` — apricot batch invocation patterns.
+- `.claude/instructions/two-host-workflow.md` — EDIT vs RUN host discipline.
+- p1-22, p2-44, p1-38 — recent objectives that hit batch-orchestration friction.