Natalie f5c5d1a410 feat(infra): distributed test/train fleet on DigitalOcean (Terraform + Packer + dispatch)

Ephemeral CPU Droplet fleet that horizontally scales the iteration loop:
- infra/terraform/test-fleet: cattle Droplets from a golden image (auto-discovered
  by name via digitalocean_images), grouped under the mc:dev DO project, with a
  mocked-provider test suite (no token/spend).
- infra/packer: golden-image builder reusing scripts/dev-setup/linux.sh.
- scripts/run/dist.sh: ./run dist:{check,up,sim,train,down} — shard sim/test
  batches across workers via autoplay-batch AUTOPLAY_HOST+SEED_OFFSET.
GPU intentionally absent (workload is CPU-bound per docs/ai-production.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-27 08:51:09 -04:00

3.6 KiB

Raw Blame History

test-fleet — distributed test/train infra (DigitalOcean)

Horizontally scales the iteration loop onto cheap ephemeral DigitalOcean Droplets. One local command fans seeded sim batches (or RL training) across N disposable workers, collects results locally, and tears the fleet down. Idle cost ≈ €0 (fleet defaults to 0 workers; only the golden image bills, ~$0.40/mo).

Layers

Layer	Where	What
Golden image	`../../packer/`	Packer bakes toolchain + warm clone + prebuilt `.so` → custom image
Fleet	here	`workers = N` Droplets from the image, auto-discovered by name
Dispatch	`scripts/run/dist.sh`	shard → fan out over ssh → collect → merge → teardown

Offline verification (no token, no spend)

./run dist:check    # terraform fmt + validate (schema typecheck) + mocked-provider test

Run it anytime — before you even have a DO account. It uses a mocked provider.

One-time setup

DigitalOcean: in the Control Panel → API → Tokens, generate a personal access token with read+write scope.
GitLab: push the repo; note the clone URL (the workers' origin).

Build the golden image once (see ../../packer/golden-image.pkr.hcl):

export DIGITALOCEAN_TOKEN=<token>
packer init   ../../packer/golden-image.pkr.hcl
packer build  -var git_remote=<gitlab-url> ../../packer/golden-image.pkr.hcl

Auth env for Terraform/dispatch:

export TF_VAR_do_token=<token>
cp terraform.tfvars.example terraform.tfvars   # set git_remote

Daily use

./run dist:up 10                 # 10 Droplets boot from the golden image (~30s)
./run dist:sim 200 300           # 200 games / turn-limit 300, sharded 20/worker
./run dist:down                  # destroy the fleet → back to ~€0
# or fold teardown into the run:
./run dist:sim 200 300 --destroy-after

Results land merged under .local/iter/<stamp>/ (disjoint seed numbers per worker via SEED_OFFSET, so no collisions). RL sweeps: ./run dist:train <steps>.

Cost

Pure pay-as-you-go, billed hourly only while workers > 0 (⚠️ approximate — confirm in the DO console):

	size	rough cost
`dist:sim` fan-out (bursty)	Basic `s-8vcpu-16gb`	$0.12/hr; a 10×30-min run ≈ $0.60
`dist:train` (sustained, hours @100%)	CPU-Optimized `c-8` (`./run dist:up N c-8`)	~$0.25/hr
idle (fleet down)	image storage only	~$0.40/mo ($0.06/GB/mo)

DigitalOcean runs ~2–3× Hetzner's per-core price, but the cattle model keeps each run to cents-to-a-dollar since you only pay hourly while a fleet is up. Use a CPU-Optimized c-* for long training runs, Basic s-* for short test fan-out.

Design notes / caveats

No persistent volume. Workers are stateless; the golden image carries the warm clone + toolchain + prebuilt GDExtension. Results leave via scp/rsync.
Image auto-discovery. data.digitalocean_images.golden selects the newest custom image whose name contains mc-golden (filter match_by = "substring", sort created desc); rebuild with Packer and the fleet picks it up — no ID edits. Set -var base_image=ubuntu-24-04-x64 only to test terraform plan before any image exists.
Coordinator needs GNU coreutils. tools/autoplay-batch.sh uses realpath -m; on macOS install coreutils or run the dispatch from a Linux host.
State holds the token — *.tfstate and terraform.tfvars are gitignored.
GPU is intentionally absent: the workload is CPU-bound (docs/ai-production.md); rent a DO GPU Droplet only if a profiler ever shows the GPU saturated.

3.6 KiB Raw Blame History Unescape Escape