feat(dist): build-artifact Space — publish/fetch/sync fetch-or-build + RL model sharing
Build the linux .so/wasm once on a worker and let sim/test/AI runners fetch the
prebuilt artifact (keyed by git sha) instead of recompiling — N workers share
one build. Adds the magicciv-artifacts DO Space, rclone in the golden image, and:
- dist:publish build + upload builds/<sha>/{.so,wasm}
- dist:fetch download the prebuilt .so for HEAD's sha
- dist:sync git pull -> fetch prebuilt if published, else build
- dist:models share RL .onnx via the Space (push/pull/ls)
Complements sccache (compile cache) by caching final outputs. Creds via
RCLONE_S3_* env over ssh, never on worker disk/argv; degrades to build-on-worker
when creds/cache absent.
Also hardens the dispatch layer (pre-existing, affected test/build/render too):
- pass -i ~/.ssh/id_mc_fleet on dispatch ssh (don't rely on agent-loaded key)
- guard _dist_first_host against an empty / "fleet down" inventory
- drop ssh -n on heredoc-stdin verbs (it redirected stdin from /dev/null)
Proven end-to-end on DO: publish built a 43.9MB .so + wasm; dist:sync fetched it
in 2.8s (no rebuild).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
b3c80b677d
commit
88bdc4210a
3 changed files with 124 additions and 18 deletions
|
|
@ -23,7 +23,7 @@ cloud-init status --wait >/dev/null 2>&1 || true
|
|||
apt-get -o DPkg::Lock::Timeout=600 update -y
|
||||
apt-get -o DPkg::Lock::Timeout=600 install -y --no-install-recommends \
|
||||
git curl ca-certificates build-essential pkg-config libssl-dev \
|
||||
unzip sudo python3-pip flatpak rsync \
|
||||
unzip sudo python3-pip flatpak rsync rclone \
|
||||
weston libgl1-mesa-dri libegl1 libgles2 libwayland-egl1 \
|
||||
mesa-vulkan-drivers vulkan-tools
|
||||
# So every worker can render proof scenes (opengl3/gl_compatibility) under a
|
||||
|
|
|
|||
|
|
@ -58,7 +58,10 @@ Distributed test/train fleet (DigitalOcean). Set TF_VAR_do_token first.
|
|||
./run dist:train <total_steps> [--destroy-after]
|
||||
./run dist:test cargo test --workspace on a worker
|
||||
./run dist:build cargo build + wasm on a worker (wasm rsync'd back)
|
||||
./run dist:sync [ref] git pull + rebuild gdext on live workers
|
||||
./run dist:publish build once → upload .so/wasm to the artifact Space (keyed by sha)
|
||||
./run dist:fetch download the prebuilt .so for HEAD's sha (skip recompile)
|
||||
./run dist:sync [ref] git pull → fetch prebuilt .so if published, else build
|
||||
./run dist:models {push <src> <name>|pull <name> <dest>|ls} share RL models via the Space
|
||||
./run dist:render <res://scene.tscn> <out.png> render a proof scene (software weston, no GPU) → png
|
||||
./run dist:down
|
||||
EOF
|
||||
|
|
@ -257,29 +260,41 @@ cmd_dist_train() {
|
|||
# already carry the toolchain (golden image) + repo (cloud-init git pull).
|
||||
|
||||
_dist_first_host() {
|
||||
local inv
|
||||
local inv h
|
||||
inv="$(_dist_repo_root)/.local/fleet/inventory"
|
||||
[ -f "$inv" ] || return 1
|
||||
_dist_read_hosts "$inv" | head -1
|
||||
h="$(_dist_read_hosts "$inv" | head -1)"
|
||||
[ -n "$h" ] || return 1 # inventory present but no live host (e.g. "fleet is down")
|
||||
printf '%s\n' "$h"
|
||||
}
|
||||
|
||||
cmd_dist_sync() {
|
||||
# Pull the given ref on every live worker + rebuild the GDExtension, so a
|
||||
# mid-session code change reaches the fleet without an image rebuild.
|
||||
# Pull the given ref on every live worker, then make the GDExtension current:
|
||||
# fetch the prebuilt .so for that sha from the artifact Space if it exists
|
||||
# (seconds), else build it. So a mid-session code change reaches the fleet
|
||||
# without an image rebuild, and N workers share one published build.
|
||||
local ref="${1:-main}"
|
||||
local root inv host
|
||||
local root inv host senv
|
||||
root="$(_dist_repo_root)"
|
||||
inv="$root/.local/fleet/inventory"
|
||||
[ -f "$inv" ] || { echo "no fleet — run ./run dist:up <N> first" >&2; return 1; }
|
||||
senv="$(_dist_spaces_env 2>/dev/null || true)" # empty → workers just build
|
||||
local pids=() p fail=0
|
||||
while IFS= read -r host; do
|
||||
echo "[$host] sync → $ref"
|
||||
ssh -n -o BatchMode=yes -o StrictHostKeyChecking=accept-new "$host" "
|
||||
set -e
|
||||
cd ~/Code/@projects/@magic-civilization
|
||||
git fetch --depth=1 origin '$ref' && git reset --hard FETCH_HEAD
|
||||
cd src/simulator && . ~/.cargo/env && bash build-gdext.sh
|
||||
" &
|
||||
echo "[$host] sync → $ref (fetch prebuilt .so, else build)"
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new -i "$HOME/.ssh/id_mc_fleet" "$host" \
|
||||
"$senv SPACE='$_DIST_SPACE' SO_PATH='$_DIST_SO_PATH' REF='$ref' bash -s" <<'REMOTE' &
|
||||
set -e
|
||||
cd ~/Code/@projects/@magic-civilization
|
||||
git fetch --depth=1 origin "$REF" && git reset --hard FETCH_HEAD
|
||||
SHA=$(git rev-parse HEAD)
|
||||
. ~/.cargo/env
|
||||
if [ -n "${RCLONE_S3_ACCESS_KEY_ID:-}" ] && rclone copyto ":s3:$SPACE/builds/$SHA/libmagic_civ_physics.x86_64.so" "$SO_PATH" 2>/dev/null; then
|
||||
echo " [$SHA] fetched prebuilt .so (no rebuild)"
|
||||
else
|
||||
( cd src/simulator && bash build-gdext.sh ) && echo " [$SHA] built .so (cache miss)"
|
||||
fi
|
||||
REMOTE
|
||||
pids+=($!)
|
||||
done < <(_dist_read_hosts "$inv")
|
||||
for p in "${pids[@]}"; do wait "$p" || fail=$(( fail + 1 )); done
|
||||
|
|
@ -292,7 +307,7 @@ cmd_dist_test() {
|
|||
host="$(_dist_first_host)" || { echo "no fleet — run ./run dist:up 1 c-8 first" >&2; return 1; }
|
||||
repo="Code/@projects/@magic-civilization"
|
||||
echo "running cargo tests on $host ..."
|
||||
ssh -n -o BatchMode=yes -o StrictHostKeyChecking=accept-new "$host" "
|
||||
ssh -n -o BatchMode=yes -o StrictHostKeyChecking=accept-new -i "$HOME/.ssh/id_mc_fleet" "$host" "
|
||||
set -e
|
||||
cd ~/$repo/src/simulator && . ~/.cargo/env
|
||||
if command -v cargo-nextest >/dev/null 2>&1; then cargo nextest run --workspace; else cargo test --workspace; fi
|
||||
|
|
@ -308,7 +323,7 @@ cmd_dist_build() {
|
|||
root="$(_dist_repo_root)"
|
||||
repo="Code/@projects/@magic-civilization"
|
||||
echo "building workspace + wasm on $host ..."
|
||||
ssh -n -o BatchMode=yes -o StrictHostKeyChecking=accept-new "$host" "
|
||||
ssh -n -o BatchMode=yes -o StrictHostKeyChecking=accept-new -i "$HOME/.ssh/id_mc_fleet" "$host" "
|
||||
set -e
|
||||
cd ~/$repo/src/simulator && . ~/.cargo/env
|
||||
cargo build --workspace
|
||||
|
|
@ -332,3 +347,74 @@ cmd_dist_render() {
|
|||
PROJECT_ROOT_REMOTE="/home/${user}/Code/@projects/@magic-civilization" \
|
||||
bash "$(_dist_repo_root)/tools/capture-proof.sh" "$scene" "$out" "${3:-180}"
|
||||
}
|
||||
|
||||
# ── build-artifact Space (magicciv-artifacts on DO Spaces) ───────────────────
|
||||
# Build once, publish the linux .so/wasm keyed by git sha; sim/test/AI runners
|
||||
# fetch the prebuilt artifact instead of recompiling. Creds: ~/.vault/do-spaces-uvlava.*
|
||||
_DIST_SPACE="magicciv-artifacts"
|
||||
_DIST_SO_PATH="src/game/engine/addons/magic_civ_physics/libmagic_civ_physics.x86_64.so"
|
||||
|
||||
# Emit an `RCLONE_S3_* ...` env-prefix string (DO Spaces creds from the vault) for
|
||||
# embedding in a remote ssh command. Empty (rc 1) if the keys are missing.
|
||||
_dist_spaces_env() {
|
||||
local ak sk
|
||||
ak="$(cat ~/.vault/do-spaces-uvlava.access 2>/dev/null)"
|
||||
sk="$(cat ~/.vault/do-spaces-uvlava.secret 2>/dev/null)"
|
||||
[ -n "$ak" ] && [ -n "$sk" ] || return 1
|
||||
printf "RCLONE_S3_PROVIDER=DigitalOcean RCLONE_S3_ENDPOINT=nyc3.digitaloceanspaces.com RCLONE_S3_ACCESS_KEY_ID='%s' RCLONE_S3_SECRET_ACCESS_KEY='%s'" "$ak" "$sk"
|
||||
}
|
||||
|
||||
cmd_dist_publish() {
|
||||
# On a worker: build gdext + wasm, upload to magicciv-artifacts/builds/<sha>/.
|
||||
local host senv
|
||||
host="$(_dist_first_host)" || { echo "no fleet — ./run dist:up 1 first" >&2; return 1; }
|
||||
senv="$(_dist_spaces_env)" || { echo "no DO Spaces creds in ~/.vault/do-spaces-uvlava.*" >&2; return 1; }
|
||||
echo "building + publishing artifacts on $host ..."
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new -i "$HOME/.ssh/id_mc_fleet" "$host" \
|
||||
"$senv SO_PATH='$_DIST_SO_PATH' SPACE='$_DIST_SPACE' bash -s" <<'REMOTE'
|
||||
set -e
|
||||
cd ~/Code/@projects/@magic-civilization
|
||||
SHA=$(git rev-parse HEAD)
|
||||
. ~/.cargo/env
|
||||
( cd src/simulator && bash build-gdext.sh && bash build-wasm.sh )
|
||||
rclone copyto "$SO_PATH" ":s3:$SPACE/builds/$SHA/libmagic_civ_physics.x86_64.so"
|
||||
[ -d .local/build/wasm ] && rclone copy .local/build/wasm ":s3:$SPACE/builds/$SHA/wasm/" || true
|
||||
printf 'sha=%s\nbuilt=%s\n' "$SHA" "$(date -u +%FT%TZ)" | rclone rcat ":s3:$SPACE/builds/$SHA/meta.txt"
|
||||
echo "published builds/$SHA/ (.so + wasm)"
|
||||
REMOTE
|
||||
}
|
||||
|
||||
cmd_dist_fetch() {
|
||||
# On a worker: fetch the prebuilt .so for the worker's HEAD sha into the addon
|
||||
# path instead of recompiling. Nonzero on a cache miss.
|
||||
local host senv
|
||||
host="$(_dist_first_host)" || { echo "no fleet — ./run dist:up 1 first" >&2; return 1; }
|
||||
senv="$(_dist_spaces_env)" || { echo "no DO Spaces creds" >&2; return 1; }
|
||||
ssh -o BatchMode=yes -o StrictHostKeyChecking=accept-new -i "$HOME/.ssh/id_mc_fleet" "$host" \
|
||||
"$senv SO_PATH='$_DIST_SO_PATH' SPACE='$_DIST_SPACE' bash -s" <<'REMOTE'
|
||||
set -e
|
||||
cd ~/Code/@projects/@magic-civilization
|
||||
SHA=$(git rev-parse HEAD)
|
||||
if rclone copyto ":s3:$SPACE/builds/$SHA/libmagic_civ_physics.x86_64.so" "$SO_PATH" 2>/dev/null; then
|
||||
echo "FETCHED prebuilt .so for $SHA"
|
||||
else
|
||||
echo "MISS: no prebuilt .so for $SHA — run ./run dist:publish"; exit 3
|
||||
fi
|
||||
REMOTE
|
||||
}
|
||||
|
||||
cmd_dist_models() {
|
||||
# Share RL model artifacts via the Space (runs on plum; models are platform-independent).
|
||||
# ./run dist:models push <src-dir-or-file> <name> ./run dist:models pull <name> <dest> ./run dist:models ls
|
||||
local sub="${1:-}" ak sk
|
||||
ak="$(cat ~/.vault/do-spaces-uvlava.access 2>/dev/null)"; sk="$(cat ~/.vault/do-spaces-uvlava.secret 2>/dev/null)"
|
||||
[ -n "$ak" ] && [ -n "$sk" ] || { echo "no DO Spaces creds in ~/.vault/do-spaces-uvlava.*" >&2; return 1; }
|
||||
export RCLONE_S3_PROVIDER=DigitalOcean RCLONE_S3_ENDPOINT=nyc3.digitaloceanspaces.com
|
||||
export RCLONE_S3_ACCESS_KEY_ID="$ak" RCLONE_S3_SECRET_ACCESS_KEY="$sk"
|
||||
case "$sub" in
|
||||
push) [ -n "${2:-}" ] && [ -n "${3:-}" ] || { echo "usage: ./run dist:models push <src> <name>" >&2; return 1; }; rclone copy "$2" ":s3:$_DIST_SPACE/models/$3/" -P ;;
|
||||
pull) [ -n "${2:-}" ] && [ -n "${3:-}" ] || { echo "usage: ./run dist:models pull <name> <dest>" >&2; return 1; }; rclone copy ":s3:$_DIST_SPACE/models/$2/" "$3" -P ;;
|
||||
ls) rclone ls ":s3:$_DIST_SPACE/models/" 2>/dev/null || echo "(empty)" ;;
|
||||
*) echo "usage: ./run dist:models {push <src> <name>|pull <name> <dest>|ls}" >&2; return 1 ;;
|
||||
esac
|
||||
}
|
||||
|
|
|
|||
|
|
@ -10,9 +10,12 @@
|
|||
| `./run dist:up <N> [size] [region]` | boot N workers from the golden image; **waits for cloud-init readiness** before returning |
|
||||
| `./run dist:test` | `cargo test --workspace` (nextest) on a worker |
|
||||
| `./run dist:build` | `cargo build` + WASM on a worker; rsync the WASM back (native `.so` is linux-only, stays on the worker) |
|
||||
| `./run dist:publish` | **build once → upload the linux `.so` + wasm to the `magicciv-artifacts` Space, keyed by git sha**. The producer side of build-once-load-many. |
|
||||
| `./run dist:fetch` | download the prebuilt `.so` for the worker's HEAD sha into the addon path — skip recompiling. Nonzero on cache miss. |
|
||||
| `./run dist:sim <games> [turns] [--destroy-after]` | fan seeded sims across workers via `autoplay-batch.sh` `AUTOPLAY_HOST`+`SEED_OFFSET`; results merge in `.local/iter/<stamp>/` |
|
||||
| `./run dist:render <res://scene.tscn> <out.png>` | render a proof scene (software weston + Mesa, **no GPU**) and pull the PNG back — replaces the dead apricot `$SCREENSHOT_HOST` |
|
||||
| `./run dist:sync [ref]` | `git pull` + rebuild gdext on **live** workers (mid-session code change, no image rebuild) |
|
||||
| `./run dist:sync [ref]` | `git pull` on **live** workers, then **fetch the prebuilt `.so` from the Space if published for that sha, else build** — N workers share one build instead of N recompiles |
|
||||
| `./run dist:models {push <src> <name>\|pull <name> <dest>\|ls}` | share RL model artifacts (`.onnx`) via the Space; runs locally on plum (models are platform-independent) |
|
||||
| `./run dist:image [--cold]` | **(re)build the golden image — incremental by default** (layers on the last snapshot, ~8 min; provision.sh is idempotent so only the delta rebuilds). `--cold` = from stock Ubuntu (~20 min), reset cruft |
|
||||
| `./run dist:prune [keep=2]` | delete superseded golden snapshots (~$0.40/mo each); keeps the newest N |
|
||||
| `./run dist:down` | tear the fleet down → **$0** |
|
||||
|
|
@ -24,7 +27,24 @@
|
|||
- **Forge**: `mc-forge` droplet running Forgejo; repo `mcadmin/magicciv`; IP + admin creds in `~/.vault/mc_forge_creds`.
|
||||
- **Golden image**: Packer `infra/packer/`, auto-discovered by the fleet (snapshot name prefix `mc-golden`). Bakes: toolchain (via `scripts/dev-setup/linux.sh`) + prebuilt GDExtension `.so` + warm Godot import + **weston/Mesa render stack** + **mold + sccache** build accelerators + the fleet ssh key in `mc`'s `authorized_keys`.
|
||||
- **Fleet TF**: `infra/terraform/test-fleet/` — DO provider, golden-image data-source discovery, grouped under the `mc:dev` DO project, mocked-provider test suite.
|
||||
- **Secrets**: `~/.vault/{do_pat_mc, mc_forge_creds}` (600). Key `~/.ssh/id_mc_fleet` (DO key `mc-fleet`).
|
||||
- **Secrets**: `~/.vault/{do_pat_mc, mc_forge_creds, do-spaces-uvlava.access, do-spaces-uvlava.secret}` (600). Key `~/.ssh/id_mc_fleet` (DO key `mc-fleet`).
|
||||
- **Artifact Space**: `magicciv-artifacts` (DO Spaces, nyc3) — `builds/<sha>/` holds the prebuilt linux `.so`+wasm; `models/<run>/` holds shared RL `.onnx`. Account already pays the Spaces subscription (for `lilith-quinn-media`), so this Space adds ~$0 base. Workers carry `rclone` (baked by `provision.sh`); the dispatch passes the Spaces creds as `RCLONE_S3_*` env over ssh (never stored on the worker, never on argv).
|
||||
|
||||
## Build once, load many (the artifact Space)
|
||||
|
||||
Fan-out used to mean N workers each recompiling the gdext. Now: **one** `dist:publish` builds + uploads the `.so` keyed by sha; every consumer (`dist:sync`, sim/test/render workers) **fetches** it. This *complements* sccache — sccache caches crate *compilation*, the Space caches the *final `.so`/wasm/models*.
|
||||
|
||||
```
|
||||
./run dist:up 3 # 3 workers from the golden image
|
||||
./run dist:publish # worker 1 builds + uploads builds/<sha>/ (once)
|
||||
./run dist:sync # all 3 workers fetch the prebuilt .so (no recompile)
|
||||
./run dist:sim 300 200 # fan sims; teardown when done
|
||||
./run dist:down
|
||||
```
|
||||
|
||||
- Keyed by **git sha** — a different sha is a cache miss → `dist:sync` falls back to building. (Toolchain changes ride the golden-image rebuild, which re-publishes.)
|
||||
- The `.so` is **linux x86_64 only** — this Space serves DO/linux runners; plum builds its own macOS `.dylib`.
|
||||
- If the Spaces creds are absent, `dist:sync` silently degrades to build-on-each-worker (no breakage).
|
||||
|
||||
## Gotchas every agent must respect
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue