feat(skill): grok-review — Claude Opus independently reviews Grok's work

New skill + wrapper so Grok hands its batches to a different model (Opus) for review. Opus re-runs the gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated .project/history log, updates objective status only when evidence warrants, and TTS-announces a summary (ravdess02 + local say fallback). Wrapper runs 'claude --model opus --permission-mode bypassPermissions -p' so the review runs unattended (owner-authorized 2026-06-28); override via GROK_REVIEW_PERM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-28 14:33:55 -04:00 · 2026-06-28 14:33:55 -04:00 · b35a3d6a65
commit b35a3d6a65
parent 52c71010c3
2 changed files with 128 additions and 0 deletions
--- a/scripts/grok-review.sh
+++ b/scripts/grok-review.sh
@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+#
+# grok-review.sh — have Claude Opus independently review Grok's work.
+#
+# Grok invokes this to hand its work to a *different* model (Opus) for review:
+# Opus runs the `grok-review` skill, which re-runs the verification gates Grok
+# cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log under
+# .project/history/, updates objective status only when the evidence warrants,
+# and TTS-announces a one-paragraph summary (ravdess02, local `say` fallback).
+#
+# Usage:
+#   scripts/grok-review.sh                 # review the current window, headless
+#   GROK_REVIEW_MODEL=opus scripts/grok-review.sh
+#   GROK_REVIEW_PERM=acceptEdits scripts/grok-review.sh   # tighter permissions
+#
+# Env:
+#   GROK_REVIEW_MODEL  claude --model value (default: opus)
+#   GROK_REVIEW_PERM   claude --permission-mode (default: bypassPermissions —
+#                      the review must run cargo/git/python/say + MCP unattended)
+#
+# Exit code is Claude's: non-zero means the review run itself failed (not that a
+# defect was found — defects are reported in the log + spoken summary).
+set -euo pipefail
+
+MODEL="${GROK_REVIEW_MODEL:-opus}"
+PERM="${GROK_REVIEW_PERM:-bypassPermissions}"
+
+REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || true)"
+if [[ -z "${REPO_ROOT}" ]]; then
+  echo "grok-review: not inside the magic-civilization git repo" >&2
+  exit 2
+fi
+cd "${REPO_ROOT}"
+
+if ! command -v claude >/dev/null 2>&1; then
+  echo "grok-review: 'claude' CLI not found on PATH" >&2
+  exit 2
+fi
+
+PROMPT='Run the grok-review skill now: independently review the latest Grok-authored work in this repo (commits trailered "Co-Authored-By: Grok (xAI)" plus the uncommitted working tree). Re-run every verification gate Grok cited — do not trust the commit messages. Record a dated review log under .project/history/, update objective status only if the evidence warrants it, commit + push the log, then TTS-announce a one-paragraph spoken summary using personality ravdess02 (fall back to local `say` if the speech MCP is unreachable). Follow AGENTS.md §2 and the skill body exactly.'
+
+exec claude \
+  --model "${MODEL}" \
+  --permission-mode "${PERM}" \
+  -p "${PROMPT}"
--- a/tooling/claude/dot-claude/skills/grok-review/SKILL.md
+++ b/tooling/claude/dot-claude/skills/grok-review/SKILL.md
@ -0,0 +1,83 @@
+---
+name: grok-review
+description: Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure.
+---
+
+# grok-review
+
+The standing job: be the **independent reviewer of Grok's work**. Grok writes code in this repo and
+closes objectives; this skill is the second pair of eyes that re-proves Grok's claims before they are
+trusted. The contract Grok broke once (and the rules that earned it) live in `AGENTS.md §2` — this
+skill enforces that contract from the outside.
+
+> Voice: collective ("we", "this node"). Verify, never infer. Cite `file:line` / command output.
+> You are reviewing, not authoring — do **not** rewrite or commit Grok's in-flight code as your own.
+
+## What counts as "Grok's work"
+
+Grok's commits carry the trailer `Co-Authored-By: Grok (xAI)`. Identify them by that, not by author
+(everything lands under the owner's git identity).
+
+```
+git log --grep='Grok (xAI)' --pretty='%h | %s' -30
+```
+
+Also review **uncommitted** working-tree changes (`git status --short`, `git diff`) — Grok is often
+mid-flight. In-flight work is reviewed but is **correctly not "done"**; never flip an objective on it.
+
+## The loop (each invocation)
+
+1. **Scope.** Find the review window: Grok commits since the last review log
+   (`ls .project/history/*grok-work-review*.md | tail -1`), plus the current uncommitted working tree.
+   If nothing new since the last cycle, say so and still record a (short) no-op cycle log.
+2. **Read the claims.** For each in-window Grok commit / `RELEASE_READINESS.md` / objective closure,
+   list the *exact* gates it cites (test counts, `cargo check`, data-validate counts, render proofs).
+3. **Re-run the gates yourself — this is the whole point (AGENTS.md §2.1).** Don't trust the message.
+   - **Rust:** `cd src/simulator && CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate>`
+     for every crate Grok touched; `cargo check --workspace` for build cleanliness. Use the FULL cargo
+     path (`~/.cargo/bin/cargo`) — background shells lack it on PATH.
+   - **Data / Rail-2:** `python3 tools/validate-game-data.py`.
+   - **Sim behavior:** the headless play loop or the `mc-sim` `sim_scenario` binary (read the real
+     JSON output — metrics + per-seed assertions), not the diff.
+   - **In-flight new code:** at minimum `cargo check -p <crate>` so a non-compiling "almost done" is
+     caught early.
+   - **Render-gated proofs** (GUT-with-display, live renders): only re-run if a render host is
+     reachable; otherwise record as *un-reproduced this cycle*, don't fake it.
+   - When a number you measure disagrees with Grok's, **dig before reporting** — most disagreements
+     are measurement races (a wait-loop exiting on the first fast result line) or a wrong `-p` target,
+     not a real defect. Confirm with a clean re-run, then report the truth either way.
+4. **Verdict per claim.** ✅ reproduced / ⚠ un-reproduced-this-cycle / ❌ contradicted. A ❌ (closure
+   outran proof, code didn't build in the closing commit, self-contradictory proof) is the finding
+   that matters — name the commit and the exact gap, per AGENTS.md §2.
+5. **Update objectives — only when evidence warrants.**
+   - If a ❌ is found on a `done` objective, set it back to `partial` with the cited gap
+     (`mcp__objectives__objective_update_status`), and message the owning team-lead
+     (`mcp__objectives__team_lead_message`).
+   - If everything reproduces, **make no status change** — a clean review *confirms* state; say so.
+   - Regenerate the dashboard if any status moved (`mcp__objectives__dashboard_regen`).
+   - Never close/open an objective on in-flight uncommitted work.
+6. **Record the review log.** Write `.project/history/YYYYMMDD_grok-work-review-NN.md` (NN = next
+   number after the last existing one). Include: scope, a claim→re-run→verdict table, any honest
+   reviewer-side measurement notes, objective-status impact, and a "next cycle" pointer. Commit it
+   atomically (scoped `git add` of just that file) with a conventional message; `git push` (forge is
+   usually up — if the push fails, commit stands locally, note it). Do **not** stage Grok's other
+   uncommitted files.
+7. **TTS-announce the summary.** One concise paragraph — what was reviewed, the verdict, any status
+   change. **Rail 4: every synthesize call passes `personality: "ravdess02"`.** Prefer the
+   `mcp__speech-synthesis__synthesize` MCP; if it is unreachable (apricot down / `/jobs` fetch
+   failed), fall back to local macOS `say` — never go silent. Keep it spoken-friendly (say "two
+   hundred ninety-seven to zero", not "297/0").
+
+## Definition of a good cycle
+
+A cycle is done when: the window is scoped, every reproducible gate Grok cited was actually re-run
+(not assumed), the verdict is recorded in a committed `.project/history/` log, objective status
+reflects the evidence (changed only if warranted), and the spoken summary went out. "Looks fine" is
+not a review — a number you didn't re-run is a number you didn't verify.
+
+## Invocation by Grok
+
+Grok drives this via `scripts/grok-review.sh`, which runs `claude --model opus` headless against this
+skill so **Opus** (not Grok) performs the review — an independent model checking Grok's work. Grok
+should call it after finishing a batch and on the recurring cadence; it must **not** review its own
+work in its own process.