Ephemeral CPU Droplet fleet that horizontally scales the iteration loop:
- infra/terraform/test-fleet: cattle Droplets from a golden image (auto-discovered
by name via digitalocean_images), grouped under the mc:dev DO project, with a
mocked-provider test suite (no token/spend).
- infra/packer: golden-image builder reusing scripts/dev-setup/linux.sh.
- scripts/run/dist.sh: ./run dist:{check,up,sim,train,down} — shard sim/test
batches across workers via autoplay-batch AUTOPLAY_HOST+SEED_OFFSET.
GPU intentionally absent (workload is CPU-bound per docs/ai-production.md).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.6 KiB
test-fleet — distributed test/train infra (DigitalOcean)
Horizontally scales the iteration loop onto cheap ephemeral DigitalOcean Droplets. One local command fans seeded sim batches (or RL training) across N disposable workers, collects results locally, and tears the fleet down. Idle cost ≈ €0 (fleet defaults to 0 workers; only the golden image bills, ~$0.40/mo).
Layers
| Layer | Where | What |
|---|---|---|
| Golden image | ../../packer/ |
Packer bakes toolchain + warm clone + prebuilt .so → custom image |
| Fleet | here | workers = N Droplets from the image, auto-discovered by name |
| Dispatch | scripts/run/dist.sh |
shard → fan out over ssh → collect → merge → teardown |
Offline verification (no token, no spend)
./run dist:check # terraform fmt + validate (schema typecheck) + mocked-provider test
Run it anytime — before you even have a DO account. It uses a mocked provider.
One-time setup
- DigitalOcean: in the Control Panel → API → Tokens, generate a personal access token with read+write scope.
- GitLab: push the repo; note the clone URL (the workers'
origin). - Build the golden image once (see
../../packer/golden-image.pkr.hcl):export DIGITALOCEAN_TOKEN=<token> packer init ../../packer/golden-image.pkr.hcl packer build -var git_remote=<gitlab-url> ../../packer/golden-image.pkr.hcl - Auth env for Terraform/dispatch:
export TF_VAR_do_token=<token> cp terraform.tfvars.example terraform.tfvars # set git_remote
Daily use
./run dist:up 10 # 10 Droplets boot from the golden image (~30s)
./run dist:sim 200 300 # 200 games / turn-limit 300, sharded 20/worker
./run dist:down # destroy the fleet → back to ~€0
# or fold teardown into the run:
./run dist:sim 200 300 --destroy-after
Results land merged under .local/iter/<stamp>/ (disjoint seed numbers per
worker via SEED_OFFSET, so no collisions). RL sweeps: ./run dist:train <steps>.
Cost
Pure pay-as-you-go, billed hourly only while workers > 0 (⚠️ approximate — confirm in the DO console):
| size | rough cost | |
|---|---|---|
dist:sim fan-out (bursty) |
Basic s-8vcpu-16gb |
|
dist:train (sustained, hours @100%) |
CPU-Optimized c-8 (./run dist:up N c-8) |
~$0.25/hr |
| idle (fleet down) | image storage only | ~$0.40/mo ($0.06/GB/mo) |
DigitalOcean runs ~2–3× Hetzner's per-core price, but the cattle model keeps each
run to cents-to-a-dollar since you only pay hourly while a fleet is up. Use a
CPU-Optimized c-* for long training runs, Basic s-* for short test fan-out.
Design notes / caveats
- No persistent volume. Workers are stateless; the golden image carries the
warm clone + toolchain + prebuilt GDExtension. Results leave via
scp/rsync. - Image auto-discovery.
data.digitalocean_images.goldenselects the newest custom image whose name containsmc-golden(filtermatch_by = "substring", sortcreated desc); rebuild with Packer and the fleet picks it up — no ID edits. Set-var base_image=ubuntu-24-04-x64only to testterraform planbefore any image exists. - Coordinator needs GNU coreutils.
tools/autoplay-batch.shusesrealpath -m; on macOS installcoreutilsor run the dispatch from a Linux host. - State holds the token —
*.tfstateandterraform.tfvarsare gitignored. - GPU is intentionally absent: the workload is CPU-bound (
docs/ai-production.md); rent a DO GPU Droplet only if a profiler ever shows the GPU saturated.