Natalie de5fbd42c4 feat(tooling): ✨ add apricot gpu device guidance

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-05-17 04:02:09 -07:00

6.1 KiB

Raw Permalink Blame History

Magic Civilization RL self-play

Open-source-RL alternative to the cloud-LLM "Claude plays the game" loop. Wraps scripts/player-api-server.sh (the generic JSON-Lines player-API harness) as a Gymnasium environment, then trains a MaskablePPO policy against the harness's built-in AI as a frozen opponent. Reports win-rate against the baseline so we can see exactly when the trained policy beats our shipping MCTS.

Why this stack

OpenSpiel was the obvious first choice for multi-agent general-sum with built-in AlphaZero/MuZero, but the action space requires a custom Game C++ wrapper or an awkward Python-side adapter — too much boilerplate for the iteration we want.
alpha-zero-general is 2-player-only and doesn't compose with Magic Civilization's diplomacy actions cleanly.
stable-baselines3 + sb3-contrib MaskablePPO with the harness as a Gym env gets us a working loop in three files of Python with action-masking out of the box. We give up MuZero-style planning, but the harness already calls into our own MCTS for opponent slots — the RL policy doesn't need to plan ahead, it needs a good policy net.

See literature pointers at the bottom of this README for why this is the right shape.

Files

File	Role
`harness_client.py`	Subprocess wrapper around `player-api-server.sh`. JSON-Lines pump with typed `view`/`act`/`end_turn`/`shutdown`. Raises `HarnessError` on protocol violations.
`encoders.py`	`PlayerView` → fixed-shape `np.float32` observation; `legal_actions` → fixed-size discrete action index + boolean mask.
`magic_civ_env.py`	`gymnasium.Env` subclass exposing the harness as one episode = one game. Implements `action_masks()` for MaskablePPO.
`train.py`	CLI entry. Builds K parallel envs (each its own harness), runs MaskablePPO, periodically evaluates against the same baseline, saves best model.
`evaluate.py`	Standalone eval — load a saved model, run N games, print `{episodes, wins, losses, draws, win_rate, mean_turns}` JSON.
`smoke.py`	Stdlib-only CI gate. Drives the harness + encoders through a random-policy loop without importing `gymnasium`/`sb3`/`torch`. Prints a one-line JSON verdict; exit 0 on `passed: true`. Run before any training session to confirm the protocol layer is intact.
`requirements.txt`	Pinned versions; `pip install -r requirements.txt` is the one-time setup.

Methodology

Frozen opponent: the harness ships our shipping MCTS as the slot-1..N AI. The RL policy controls slot 0. The opponent's strength is constant while the policy trains, so improvement is directly measurable.
Iterate until beat: training runs until eval win-rate against the frozen opponent crosses --target-win-rate (default 0.55). Cross at 0.55+ → save as a "graduated" snapshot; raise the target for the next run; eventually use the graduated snapshot as the new frozen opponent and re-train against itself (classic AlphaZero curriculum).
Action mask is load-bearing: MaskablePPO zeros the sampling distribution at masked positions. Without it, the policy spends half its time learning that 95% of action indices are illegal.

Run it

Smoke test the protocol layer first (no heavy deps required):

cd /Users/natalie/Code/@projects/@magic-civilization
python3 -m tooling.rl_self_play.smoke --turns 30
# → {"steps": 332, "turns_reached": 30, "mask_violations": 0,
#    "harness_errors": 0, "passed": true}

Then install RL deps and train:

pip install -r tooling/rl_self_play/requirements.txt
python -m tooling.rl_self_play.train --total-steps 1_000_000 --num-envs 4
# In a second terminal:
tensorboard --logdir tooling/rl_self_play/runs/

Apricot GPU layout

Apricot has 2× NVIDIA RTX 3090 (24 GB each). The typical division:

cuda:0 — model-boss inference / commit-message daemon (frequently busy).
cuda:1 — free; use this for RL training to avoid contention.

ssh apricot
cd ~/Code/project-buildspace/magic-civilization   # or wherever the canonical checkout lives
pip install -r tooling/rl_self_play/requirements.txt   # one-time
python -m tooling.rl_self_play.train --device cuda:1 --num-envs 8 --total-steps 5_000_000

--device auto is the safe default for a single-GPU box or local Mac (mps on Apple Silicon). The MlpPolicy this scaffold uses fits in well under 1 GB VRAM, so the bottleneck is the harness CPU subprocesses rather than the GPU. Raise --num-envs (one harness each) to keep the GPU fed.

For evaluation only (no training):

python -m tooling.rl_self_play.evaluate \
  --model-path tooling/rl_self_play/models/duel-v1/best_model.zip \
  --episodes 50

Honest caveats

First successful win against the baseline will take hours-days of training on apricot's 64-core box. Magic Civilization has a large action space and 200-turn horizons; PPO is sample-inefficient compared to AlphaZero, and our action encoding is lossy (see TODO list in encoders.py).
Action encoding discards information: per-tile detail isn't in the observation; the action head can only target adjacent hexes, not arbitrary positions. Upgrade to a CNN-on-tile-grid observation
- a hierarchical action head once the basic loop is winning at least occasionally.
Single-slot only: this is 1v1 duel-mode for now. The 5-player huge-map setup that p1-22a validated needs MAX_PLAYERS=5 in the env config and 4 frozen opponents — straightforward extension once the duel loop trains.

References

OpenSpiel: A Framework for RL in Games — the canonical multi-agent RL framework; the right move once we need MuZero-style planning.
stable-baselines3 + sb3-contrib MaskablePPO — what we use today.
CivRealm (BIGAI, ICLR 2024) — closest analog; RL + LLM baselines for full-game Civilization.
Simulation-Driven Balancing with RL (arXiv 2503.18748) — the broader methodology this loop sits inside.

6.1 KiB Raw Permalink Blame History Unescape Escape