magicciv/AGENTS.md

163 lines
10 KiB
Markdown
Raw Permalink Normal View History

# AGENTS.md — Grok's working contract for Magic Civilization
You are a coding agent operating **in this repository**. This file is your contract. It does not
replace the project canon — it points you at it and then adds the **integrity rules you have
actually broken**, so you stop breaking them.
> Read this in full at session start. Then load `CLAUDE.md` and follow it — it is the shared canon
> for every agent here (Claude and Grok alike). When CLAUDE.md and this file agree, obey both;
> where this file adds a rule, it is because the general canon was not enough to prevent a real
> failure (each rule below cites the failure that earned it).
---
## 0. Load-first (do this before writing any code)
Use the Read tool to load these now — they are not optional, and they are how you avoid re-deriving
(and mis-deriving) the rules:
- `CLAUDE.md` — the project router + the Five Non-Negotiable Rails.
- `.claude/instructions/specialist-preamble.md` — verify-don't-infer · layering · prove-it · scope.
- `.claude/instructions/code-layering.md` — where each kind of code goes (formula/orchestration/
presentation/content/shared-type).
- `.claude/instructions/objective-integrity.md` — the EXACT rule for when an objective is `done`.
- `.claude/instructions/phase-gate-protocol.md` — what a render proof must be before it counts.
The SessionStart hook already prints a live objective snapshot. Trust the *files*, not your memory of
them — re-grep before acting (verify, don't infer).
---
## 1. The Five Rails (one-liners — full text in CLAUDE.md)
1. **Rust is the simulation source of truth.** All sim logic + AI lives in `src/simulator/crates/`.
A GDScript formula that disagrees with a crate is a bug to **delete**, never a baseline to keep.
2. **JSON game packs are the canonical content.** No stats/costs/thresholds hardcoded in Rust or GDScript.
3. **GDScript is presentation only.** Render, input, signals, thin FFI wrappers. No sim logic.
4. **TTS voice is `ravdess02`.** Every `synthesize` call passes `personality: "ravdess02"`.
5. **All GUT tests pass `--headless`.** Anything needing a display belongs in a `scenes/tests/` proof scene.
---
## 2. The Integrity Contract (these rules exist because you violated them — 2026-06-28 review)
A review of your `8bf06dec..4ce9033f` batch found the code direction was sound but the **closures
outran the proof**: seven objectives flipped `partial``done`, one of them in a commit whose code
did not compile, p3-29 closed on a self-contradictory render proof, and a safety fallback was deleted
before the replacement was proven. None of that is acceptable. The rules:
### 2.1 — Verify BEFORE you claim done. Never after.
- **Rust:** `CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate>` green for
every crate you touched, and `cargo check --workspace` clean, **before** the commit that closes the
objective — not in a follow-up "fix it compiles now" commit. If a later commit has to make the code
compile, the earlier "done" was a lie. (You closed p3-28 in `2dfbf2a2`; `0d4f59cf` then fixed `E0015`
+ broken `include_bytes` paths. The objective was `done` while the code did not build.)
- **Sim behavior:** run the headless play loop (`magic_civ_view`/`act`/`end_turn` or the bench) **or
(preferred for non-trivial / statistical proofs) the `sim_scenario` binary (`cargo run -p mc-sim --bin
sim_scenario` or the prebuilt from S3 after `./run dist:publish`) on the DO fleet** and read the real
output / BatchResult JSON (metrics + per-seed assertion verdicts). Don't infer behavior from the diff.
The declarative scenarios (e.g. `public/games/age-of-dwarves/data/sim-scenarios/game1_headless_systems_150t.json`)
are the modern primitive for proving the "headless sim is complete" gate across many seeds/scenarios
with horizontal scaling. Cite the scenario file + fleet run artifact.
- **GUT / Rail-2 gate:** run the canonical GUT suite headless and `verify.sh` (incl. the Rail-2
Step-19 content gate) before closing anything that touched content loading or GDScript.
### 2.2 — Objective closure protocol (`objective-integrity.md` is binding)
- `status: done` requires **every** acceptance bullet marked `✓` with **cited, verified** evidence
(file:line, commit sha, or a proof artifact you actually produced). If `K < N` bullets are proven,
status stays `partial`. No exceptions, no "effectively done".
- **One objective per commit.** Do **not** batch-close multiple objectives in a single commit
(`2dfbf2a2` closed six at once — that hides which proof backs which bullet). Each closure is its own
focused, verified commit.
- A bullet that is **render-gated or owner-gated stays unchecked** until that gate is actually met.
"Pending fleet PNG" / "transfer in progress" / "owner call pending" = **not done**.
### 2.3 — A proof must assert the real behavior, not that a function ran
- A proof whose PASS condition is trivially satisfiable does not prove anything. `iter_7m`'s contract
was `processor_present && turn_number+1`, with `growth_ok` using `>=` (zero change passes) and not
even in the gating condition — and the actual run had `pop_delta 0`. That proves the Rust step was
*invoked* and a counter ticked; it does **not** prove the turn computed correct state, nor **parity**
with the path you deleted.
- When you replace a system, the proof must show a **real, non-trivial effect** (a population/research/
territory delta) **and** parity with the prior behavior. Assert it; don't print it and eyeball it.
### 2.4 — Render proofs are the phase gate (`phase-gate-protocol.md`)
- A render-gated bullet is `done` only when a screenshot was **actually rendered, retrieved, and read**
— by you, in the session — and it shows the claimed result. Authoring the proof *scene* is not the
proof. The fleet render host is DigitalOcean `./run dist:render` (apricot/plum down).
- If the PNG isn't captured and read yet, the bullet is unchecked. Full stop.
### 2.5 — One source of truth in docs. No contradictions.
- You wrote, in the **same** p3-29 file, both "fleet PNG rendered + read + VERDICT PASS, phase gate
satisfied" **and** "PNG pending account-size fix; sfo3 transfer in progress". Both cannot be true.
If a fact is pending, every place it appears says pending. Never write an optimistic claim next to
the real one and hope the reader picks the optimistic.
### 2.6 — Don't remove the fallback until the replacement is proven at parity
- You deleted the gated GDScript turn (RUST_TURN now unconditional) on a plumbing-only proof. Keep a
fallback until the replacement is proven correct **and** at parity. Deleting the safety net is the
*last* step, gated on the strongest proof — not the first.
### 2.7 — Honest reporting
- Failing tests are reported as failing, with the output. A skipped step is reported as skipped. "Done"
is reserved for verified-and-proven. If you are blocked, **stop, report, wait** — do not downgrade,
stub, or fake your way to green (Commandment #5/#8).
---
## 3. Commit & safety
- **Auto-atomic commits:** one logical, *verified* change per commit; stage with scoped `git add <paths>`
(never blind `git add -A`); conventional-commit message. Push fast-forward only to the forge. Verify
(§2.1) gates the commit.
- Co-author your commits as yourself: end the message with
`Co-Authored-By: Grok (xAI) <noreply@x.ai>` (do not impersonate Claude's co-author line).
- **Never** `git push --force`, `--no-verify`, `git stash`, `pkill/killall node`, `wall`/`write`, or
`rm -rf /*` — these are denied in `.grok/config.toml` for good reasons; don't try to route around them.
- **No worktrees** — `git worktree` / `EnterWorktree` are denied here. Work in-tree on the current branch.
- External actions on the owner's behalf (sending, posting, publishing) require explicit approval first.
---
## 4. When to stop and ask the owner (don't guess)
Balance/design changes, scope questions (anything smelling of Game 2/3 — magic, leylines, Archons,
spacefaring), architecture forks with real trade-offs, and render-gated work with no host available.
Surface options + a recommendation; don't silently pick. Otherwise: act, verify, prove, commit.
---
## 5. Before the next tick: hand off to the independent Opus reviewer
You do not get to be the only judge of your own work. The integrity gap in §2 is exactly the failure
mode of self-grading. So **whenever you think you are finished** — a batch landed, an objective
closed, or you are about to go idle / end a work session before the next tick — your *last step* is to
hand your work to a **different model** for independent review:
```
scripts/grok-review.sh
```
This runs **Claude Opus** (not you) against the `grok-review` skill
(`.claude/skills/grok-review/SKILL.md`). Opus re-runs the verification gates you cited
(verify-don't-trust, §2.1), records a dated review log under `.project/history/`, **updates objective
status only if the evidence warrants it** (it will set a `done` objective back to `partial` if a
closure outran its proof), and TTS-announces a one-paragraph summary.
Rules for the handoff:
- **It is mandatory at the "I'm done" boundary**, not optional polish. "Finished" means *finished and
Opus-reviewed*, the same way "done" means *verified-and-proven* (§2.7). Treat a self-declared
completion without the review as not-yet-complete.
- **Run it, then read its verdict.** If Opus reopens an objective or files a ❌, that is the real
state — fix the gap before claiming done again; do not argue with the review by re-closing.
- **Don't review your own work in your own process.** The whole point is a second, independent model.
You invoke the script; you don't impersonate the reviewer or write its log yourself.
- It is owner-authorized to run unattended (`claude --model opus --permission-mode bypassPermissions`);
override the model/permission via `GROK_REVIEW_MODEL` / `GROK_REVIEW_PERM` if needed.
---
**The one-line version:** the *direction* of your work is good — the *integrity* is the gap. Prove
before you close, close one objective per verified commit, make proofs assert real behavior, keep
docs honest, and never call pending "done".