magicciv/tooling/rl_self_play/README.md
Natalie de5fbd42c4 feat(tooling): add apricot gpu device guidance
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-05-17 04:02:09 -07:00

6.1 KiB
Raw Permalink Blame History

Magic Civilization RL self-play

Open-source-RL alternative to the cloud-LLM "Claude plays the game" loop. Wraps scripts/player-api-server.sh (the generic JSON-Lines player-API harness) as a Gymnasium environment, then trains a MaskablePPO policy against the harness's built-in AI as a frozen opponent. Reports win-rate against the baseline so we can see exactly when the trained policy beats our shipping MCTS.

Why this stack

  • OpenSpiel was the obvious first choice for multi-agent general-sum with built-in AlphaZero/MuZero, but the action space requires a custom Game C++ wrapper or an awkward Python-side adapter — too much boilerplate for the iteration we want.
  • alpha-zero-general is 2-player-only and doesn't compose with Magic Civilization's diplomacy actions cleanly.
  • stable-baselines3 + sb3-contrib MaskablePPO with the harness as a Gym env gets us a working loop in three files of Python with action-masking out of the box. We give up MuZero-style planning, but the harness already calls into our own MCTS for opponent slots — the RL policy doesn't need to plan ahead, it needs a good policy net.

See literature pointers at the bottom of this README for why this is the right shape.

Files

File Role
harness_client.py Subprocess wrapper around player-api-server.sh. JSON-Lines pump with typed view/act/end_turn/shutdown. Raises HarnessError on protocol violations.
encoders.py PlayerView → fixed-shape np.float32 observation; legal_actions → fixed-size discrete action index + boolean mask.
magic_civ_env.py gymnasium.Env subclass exposing the harness as one episode = one game. Implements action_masks() for MaskablePPO.
train.py CLI entry. Builds K parallel envs (each its own harness), runs MaskablePPO, periodically evaluates against the same baseline, saves best model.
evaluate.py Standalone eval — load a saved model, run N games, print {episodes, wins, losses, draws, win_rate, mean_turns} JSON.
smoke.py Stdlib-only CI gate. Drives the harness + encoders through a random-policy loop without importing gymnasium/sb3/torch. Prints a one-line JSON verdict; exit 0 on passed: true. Run before any training session to confirm the protocol layer is intact.
requirements.txt Pinned versions; pip install -r requirements.txt is the one-time setup.

Methodology

  1. Frozen opponent: the harness ships our shipping MCTS as the slot-1..N AI. The RL policy controls slot 0. The opponent's strength is constant while the policy trains, so improvement is directly measurable.
  2. Iterate until beat: training runs until eval win-rate against the frozen opponent crosses --target-win-rate (default 0.55). Cross at 0.55+ → save as a "graduated" snapshot; raise the target for the next run; eventually use the graduated snapshot as the new frozen opponent and re-train against itself (classic AlphaZero curriculum).
  3. Action mask is load-bearing: MaskablePPO zeros the sampling distribution at masked positions. Without it, the policy spends half its time learning that 95% of action indices are illegal.

Run it

Smoke test the protocol layer first (no heavy deps required):

cd /Users/natalie/Code/@projects/@magic-civilization
python3 -m tooling.rl_self_play.smoke --turns 30
# → {"steps": 332, "turns_reached": 30, "mask_violations": 0,
#    "harness_errors": 0, "passed": true}

Then install RL deps and train:

pip install -r tooling/rl_self_play/requirements.txt
python -m tooling.rl_self_play.train --total-steps 1_000_000 --num-envs 4
# In a second terminal:
tensorboard --logdir tooling/rl_self_play/runs/

Apricot GPU layout

Apricot has 2× NVIDIA RTX 3090 (24 GB each). The typical division:

  • cuda:0 — model-boss inference / commit-message daemon (frequently busy).
  • cuda:1 — free; use this for RL training to avoid contention.
ssh apricot
cd ~/Code/project-buildspace/magic-civilization   # or wherever the canonical checkout lives
pip install -r tooling/rl_self_play/requirements.txt   # one-time
python -m tooling.rl_self_play.train --device cuda:1 --num-envs 8 --total-steps 5_000_000

--device auto is the safe default for a single-GPU box or local Mac (mps on Apple Silicon). The MlpPolicy this scaffold uses fits in well under 1 GB VRAM, so the bottleneck is the harness CPU subprocesses rather than the GPU. Raise --num-envs (one harness each) to keep the GPU fed.

For evaluation only (no training):

python -m tooling.rl_self_play.evaluate \
  --model-path tooling/rl_self_play/models/duel-v1/best_model.zip \
  --episodes 50

Honest caveats

  • First successful win against the baseline will take hours-days of training on apricot's 64-core box. Magic Civilization has a large action space and 200-turn horizons; PPO is sample-inefficient compared to AlphaZero, and our action encoding is lossy (see TODO list in encoders.py).
  • Action encoding discards information: per-tile detail isn't in the observation; the action head can only target adjacent hexes, not arbitrary positions. Upgrade to a CNN-on-tile-grid observation
    • a hierarchical action head once the basic loop is winning at least occasionally.
  • Single-slot only: this is 1v1 duel-mode for now. The 5-player huge-map setup that p1-22a validated needs MAX_PLAYERS=5 in the env config and 4 frozen opponents — straightforward extension once the duel loop trains.

References