feat(@projects): add async batch protocol docs

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
Natalie 2026-05-05 14:06:40 -04:00
parent fbaca9de95
commit 2e331d2b07

View file

@ -0,0 +1,49 @@
---
id: p2-64
title: Apricot async batch protocol — launch / status / fetch decoupling
priority: p2
status: stub
scope: game1
category: infra
owner: simulator-infra
created: 2026-05-05
updated_at: 2026-05-05
blocked_by: []
follow_ups: []
---
## Context
Today `scripts/apricot-run.sh` runs a single synchronous flow: fetch → worktree → build → batch → fetch verdict → cleanup. The orchestration runs on plum and SSHes to apricot multiple times. When apricot connectivity drops mid-run (intermittent network, sshd channel saturation, sleep/wake), the local script aborts before fetching results — even though the apricot-side godot processes continue and write results to `.cache/mc-batches/<stamp>/` regardless.
This couples job lifecycle to live SSH and forces every wake to do expensive ssh probes. With intermittent connectivity, that costs lost orchestration even when no work was lost.
## Acceptance
- ❌ `scripts/apricot-run.sh launch <mode> <args>` — fires the orchestration entirely on apricot via `systemd-run --user --unit=mc-batch-<stamp> --collect`. Returns immediately with `STAMP=<value>` on stdout (one line, scriptable). The systemd unit owns build + batch lifecycle; survives SSH disconnects.
- ❌ `scripts/apricot-run.sh status <stamp>` — single short SSH probe (`ConnectTimeout=5`), structured stdout: `{"state":"running|complete|failed|unreachable","seeds_done":N,"seeds_total":M,"completion_marker":bool}`. Tolerates SSH timeouts (returns `unreachable` on probe failure).
- ❌ `scripts/apricot-run.sh fetch <stamp>``rsync -a --partial` pulls `~/.cache/mc-batches/<stamp>/` to `.local/iter/<stamp>/`. Resumable. Exits 1 if batch isn't complete yet (so callers can retry).
- ❌ Existing synchronous modes (`smoke`, `huge-map-5clan`, `ai-quality-baseline-pre-c`, etc.) keep working — `launch` is a new sub-mode that wraps them, not a replacement. Backwards-compat for callers that DO want to block.
- ❌ Documentation in `scripts/apricot-run.sh` header + a short example snippet in `.claude/instructions/canonical-commands.md` showing the launch/status/fetch loop.
- ❌ `mc-batch-<stamp>.service` systemd-unit template lives at `scripts/dev-setup/mc-batch.service.in` (or inline in apricot-run.sh) — instantiated per-stamp via `systemd-run --user --unit=...`, with `KillMode=mixed` + `TimeoutStopSec=10s` for clean shutdown.
## Source-of-truth rails
- **Bash crate**: `scripts/apricot-run.sh` owns the protocol. No GDScript dependency.
- **systemd unit**: `--user` scope (per-user lifecycle), `--collect` (auto-clean on success), `--unit=mc-batch-<stamp>` (named for status query).
- **Result location**: existing `~/.cache/mc-batches/<stamp>/` — no new path.
- **Status output**: JSON, machine-readable. No prose.
- **No backwards-compat shim** for the old single-call pattern — both work, the user picks.
## Out of scope
- Cross-host queueing (only one batch can run at a time on apricot today; that's fine).
- Retry/resume of the build phase (if apricot reboots mid-build, the unit fails and a new `launch` is required).
- Result-collection across multiple stamps in one fetch call (batch by batch).
## References
- `scripts/apricot-run.sh` — current synchronous flow.
- `.claude/instructions/canonical-commands.md` — apricot batch invocation patterns.
- `.claude/instructions/two-host-workflow.md` — EDIT vs RUN host discipline.
- p1-22, p2-44, p1-38 — recent objectives that hit batch-orchestration friction.