magicciv/infra/terraform/test-fleet/README.md
Natalie f5c5d1a410 feat(infra): distributed test/train fleet on DigitalOcean (Terraform + Packer + dispatch)
Ephemeral CPU Droplet fleet that horizontally scales the iteration loop:
- infra/terraform/test-fleet: cattle Droplets from a golden image (auto-discovered
  by name via digitalocean_images), grouped under the mc:dev DO project, with a
  mocked-provider test suite (no token/spend).
- infra/packer: golden-image builder reusing scripts/dev-setup/linux.sh.
- scripts/run/dist.sh: ./run dist:{check,up,sim,train,down} — shard sim/test
  batches across workers via autoplay-batch AUTOPLAY_HOST+SEED_OFFSET.
GPU intentionally absent (workload is CPU-bound per docs/ai-production.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:51:09 -04:00

3.6 KiB
Raw Blame History

test-fleet — distributed test/train infra (DigitalOcean)

Horizontally scales the iteration loop onto cheap ephemeral DigitalOcean Droplets. One local command fans seeded sim batches (or RL training) across N disposable workers, collects results locally, and tears the fleet down. Idle cost ≈ €0 (fleet defaults to 0 workers; only the golden image bills, ~$0.40/mo).

Layers

Layer Where What
Golden image ../../packer/ Packer bakes toolchain + warm clone + prebuilt .so → custom image
Fleet here workers = N Droplets from the image, auto-discovered by name
Dispatch scripts/run/dist.sh shard → fan out over ssh → collect → merge → teardown

Offline verification (no token, no spend)

./run dist:check    # terraform fmt + validate (schema typecheck) + mocked-provider test

Run it anytime — before you even have a DO account. It uses a mocked provider.

One-time setup

  1. DigitalOcean: in the Control Panel → API → Tokens, generate a personal access token with read+write scope.
  2. GitLab: push the repo; note the clone URL (the workers' origin).
  3. Build the golden image once (see ../../packer/golden-image.pkr.hcl):
    export DIGITALOCEAN_TOKEN=<token>
    packer init   ../../packer/golden-image.pkr.hcl
    packer build  -var git_remote=<gitlab-url> ../../packer/golden-image.pkr.hcl
    
  4. Auth env for Terraform/dispatch:
    export TF_VAR_do_token=<token>
    cp terraform.tfvars.example terraform.tfvars   # set git_remote
    

Daily use

./run dist:up 10                 # 10 Droplets boot from the golden image (~30s)
./run dist:sim 200 300           # 200 games / turn-limit 300, sharded 20/worker
./run dist:down                  # destroy the fleet → back to ~€0
# or fold teardown into the run:
./run dist:sim 200 300 --destroy-after

Results land merged under .local/iter/<stamp>/ (disjoint seed numbers per worker via SEED_OFFSET, so no collisions). RL sweeps: ./run dist:train <steps>.

Cost

Pure pay-as-you-go, billed hourly only while workers > 0 (⚠️ approximate — confirm in the DO console):

size rough cost
dist:sim fan-out (bursty) Basic s-8vcpu-16gb $0.12/hr; a 10×30-min run ≈ **$0.60**
dist:train (sustained, hours @100%) CPU-Optimized c-8 (./run dist:up N c-8) ~$0.25/hr
idle (fleet down) image storage only ~$0.40/mo ($0.06/GB/mo)

DigitalOcean runs ~23× Hetzner's per-core price, but the cattle model keeps each run to cents-to-a-dollar since you only pay hourly while a fleet is up. Use a CPU-Optimized c-* for long training runs, Basic s-* for short test fan-out.

Design notes / caveats

  • No persistent volume. Workers are stateless; the golden image carries the warm clone + toolchain + prebuilt GDExtension. Results leave via scp/rsync.
  • Image auto-discovery. data.digitalocean_images.golden selects the newest custom image whose name contains mc-golden (filter match_by = "substring", sort created desc); rebuild with Packer and the fleet picks it up — no ID edits. Set -var base_image=ubuntu-24-04-x64 only to test terraform plan before any image exists.
  • Coordinator needs GNU coreutils. tools/autoplay-batch.sh uses realpath -m; on macOS install coreutils or run the dispatch from a Linux host.
  • State holds the token*.tfstate and terraform.tfvars are gitignored.
  • GPU is intentionally absent: the workload is CPU-bound (docs/ai-production.md); rent a DO GPU Droplet only if a profiler ever shows the GPU saturated.