magicciv/tooling/claude/dot-claude/skills/finish-game-1/SKILL.md
Natalie 78574007e0 docs(agents): require Opus self-review handoff before Grok's next tick
Wire scripts/grok-review.sh into Grok's contract as the mandatory last step at
the 'I'm done' boundary: when Grok thinks a batch/objective/session is finished,
it hands off to an independent model (Claude Opus) that re-runs the cited gates
and updates objective status before the next tick. Self-grading is the §2 failure
mode; a second model closes it.

- AGENTS.md §5: 'Before the next tick — hand off to the independent Opus reviewer'
  (finished == finished AND Opus-reviewed; read the verdict, don't re-close around it).
- finish-game-1 SKILL.md: loop step 9 mirrors the handoff at session end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 14:49:09 -04:00

7.9 KiB

name description
finish-game-1 Autonomously drive Game 1 "Age of Dwarves" to completion — pick up the next in-flight objective, do it correctly per the project rails, verify it, commit it, and continue. Replaces the ad-hoc "/loop finish game 1". Use when the user says "finish game 1", "continue game 1", "keep working on the game", "drive game 1 to done", or just "." / "continue" in a Game-1 work session.

finish-game-1

The standing mission: take Game 1 "Age of Dwarves" to done. Don't wait for a trigger or the next tick — this skill exists to keep you going. Work continuously, pick the highest-value next thing, do it correctly, prove it, commit it, move on. Surface to the owner only for genuine decisions; otherwise act.

Definition of done (the bar — read it live, don't assume)

Game 1 is finished when all three hold:

  1. Scope complete — every Early-Access objective in .project/ROADMAP.md + .project/objectives/ is done (not partial/stub), counted per objective-integrity.md. (oos = out-of-scope, doesn't block.)
  2. Headless sim is completemc-turn plays full self-play games with ALL systems (climate, ecology/flora/marine/disease, happiness, healing, improvements, recipes, equipment, events, combat, economy). The loop is NOT done while a system the live game has is missing headless. Preferred proof tool: the declarative scenarios under public/games/age-of-dwarves/data/sim-scenarios/ (especially game1_headless_systems_150t.json) executed via the mc-sim sim_scenario binary on the DO fleet after ./run dist:publish (the publish step now ships the bin to S3 alongside the .so). Run across many seeds for statistical, assertion-bearing results (JSON with metrics + pass/fail). This is the scalable, horizontal way to get real non-trivial evidence that the full turn loop exercises everything. Cite the scenario JSON + fleet run output.
  3. Rail-1 architecture unified — the live game is a pure view of getState(): Rust owns state
    • runs the turn (end_turn), GDScript renders view_json + sends act(). No GDScript-held authoritative state, no GDScript turn orchestration, no inlined formulas. (Tracked by p3-25/p3-29.)

Don't declare done from memory — re-run the orientation and the objective dashboard.

The loop (each iteration)

  1. Orient. Run bash .claude/hooks/session-orient.sh --human (or read .project/objectives/
    • .project/ROADMAP.md). Find in-flight (partial/stub) objectives and recent commits.
  2. Load the rails. Read .claude/instructions/specialist-preamble.md and code-layering.md. These are non-negotiable; everything below assumes them.
  3. Pick the next work. Highest-value first: finish a partial before starting new; prefer headless-verifiable work; defer render-gated work (UI/live rendering) until a render host (apricot/plum) is available — note it, don't fake the proof.
  4. Classify & place (code-layering): formula→crate, orchestration→mc-turn/the turn, presentation→GDScript (a pure view of getState()), content→JSON, shared type→mc-core. grep the owning crate before computing any game number — call it, don't reimplement.
  5. Implement in the right layer. Dispatch a specialist (or team-lead for multi-domain) when it's a cross-file domain sweep; do single known edits inline.
  6. Verify (mandatory, by type): Rust → cargo test -p <crate> (CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0); sim behavior → headless play loop (view/act/end_turn or the sim_scenario binary from mc-sim on the DO fleet after dist:publish, reading the real JSON output with metrics + assertions); golden moved → re-pin intentionally + re-check determinism; UI/live/rendered → render-proof (phase gate). "Looks done" is not done. For the main "headless sim complete" gate, the canonical scenario run on fleet (multiple seeds) is stronger evidence than a single local bench run.
  7. Commit atomically — one logical change, scoped git add <paths>, conventional message. Don't push (forge is down; the owner's standing call). Update the objective's status + acceptance bullets per objective-integrity.md.
  8. Continue to the next iteration. Keep going until a stop condition below.
  9. Before the next tick — when you think you're finished, hand off to the independent Opus reviewer. When a batch has landed and you are about to go idle / end the session (you believe the current work is done), your last step is to run scripts/grok-review.sh. That launches Claude Opus (a different model) against the grok-review skill: it re-runs the gates you cited, writes a dated .project/history/ review log, updates objective status only if the evidence warrants it (it will reopen a done objective whose closure outran its proof), and TTS-announces a summary. "Finished" = finished and Opus-reviewed — a self-declared completion without the review is not yet complete (binding: AGENTS.md §5). Read the verdict; if it reopens an objective, fix the gap, don't re-close around it.

When to STOP and ask the owner (don't guess)

Use AskUserQuestion — these are the owner's calls, not yours:

  • Balance / design — e.g. a crate value changes gameplay. Rust drives the number; tuning happens in Rust/JSON with sign-off. Never resolve a balance question by editing GDScript.
  • Scope — anything that smells like Game 2/3 (magic, leylines, Archons), or building a system that's disabled in the live game (parity ≠ gap — don't gold-plate).
  • Architecture forks — a structural choice with real trade-offs (surface the options + a recommendation; don't silently pick).
  • Render-gated work with no host available — report it as blocked-on-host, move to other work.

Otherwise: act. Don't narrate options you won't pursue; don't re-litigate decided things.

Guardrails (the lessons this project paid for)

  • Verify, don't infer — including your own premises. grep/read + cite file:line. A plan built on a remembered shape drifts (this project drifted swap→extract→FFI across three turns; one grep collapsed it). Re-check the shape before planning. Docs/memory drift — code wins.
  • Rust drives everything. A GDScript formula that disagrees with the crate is a bug to delete, never a baseline to reconcile. The UI is a pure view of getState().
  • Eliminate, don't fix, the orchestrator. When you find logic in GDScript, prefer deleting the path (Rust computes, UI renders getState()) over making GDScript "call Rust correctly".
  • No stubs, no fakes, no fabricated "done". Production code on the first pass; if blocked, STOP → report → wait. Report outcomes faithfully (failing tests stay reported as failing).
  • Don't gold-plate. Build to the objective's acceptance bullets, not beyond.

Reporting

After each meaningful chunk: a tight status — what landed, the proof, the commit, what's next. When you stop, say why (decision needed / blocked on host / done) in one line. Don't pad.

Announce specialist lifecycle (the "Orchestration transparency" convention in agents-task-map.md): when you dispatch, emit a start line — ▶ Dispatching [parallel|sequential] (N): <agent>(task), … — and a finish line per specialist — ✓ <agent> — <outcome> · <proof> / ✗ <agent> — <blocker>. Say "parallel" only when you actually send them in one message. This is how the user sees the orchestration happening + verifies parallelism. Reserve TTS (ravdess02) / PushNotification for milestone / decision / blocker — not per-dispatch (that's text).

Simulation testing primitive (new): the sim_scenario tool + declarative JSONs in the game data pack are now the canonical way for the "headless sim complete" gate and sim-behavior verification in this loop. Always prefer fleet runs (after dist:publish) for them so the proofs are horizontal and statistical.