From b35a3d6a651ea1d139ba75acf58e57c39044bd26 Mon Sep 17 00:00:00 2001 From: Natalie Date: Sun, 28 Jun 2026 14:33:55 -0400 Subject: [PATCH] =?UTF-8?q?feat(skill):=20grok-review=20=E2=80=94=20Claude?= =?UTF-8?q?=20Opus=20independently=20reviews=20Grok's=20work?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit New skill + wrapper so Grok hands its batches to a different model (Opus) for review. Opus re-runs the gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated .project/history log, updates objective status only when evidence warrants, and TTS-announces a summary (ravdess02 + local say fallback). Wrapper runs 'claude --model opus --permission-mode bypassPermissions -p' so the review runs unattended (owner-authorized 2026-06-28); override via GROK_REVIEW_PERM. Co-Authored-By: Claude Opus 4.8 (1M context) --- scripts/grok-review.sh | 45 ++++++++++ .../dot-claude/skills/grok-review/SKILL.md | 83 +++++++++++++++++++ 2 files changed, 128 insertions(+) create mode 100755 scripts/grok-review.sh create mode 100644 tooling/claude/dot-claude/skills/grok-review/SKILL.md diff --git a/scripts/grok-review.sh b/scripts/grok-review.sh new file mode 100755 index 00000000..1dc46c32 --- /dev/null +++ b/scripts/grok-review.sh @@ -0,0 +1,45 @@ +#!/usr/bin/env bash +# +# grok-review.sh — have Claude Opus independently review Grok's work. +# +# Grok invokes this to hand its work to a *different* model (Opus) for review: +# Opus runs the `grok-review` skill, which re-runs the verification gates Grok +# cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log under +# .project/history/, updates objective status only when the evidence warrants, +# and TTS-announces a one-paragraph summary (ravdess02, local `say` fallback). +# +# Usage: +# scripts/grok-review.sh # review the current window, headless +# GROK_REVIEW_MODEL=opus scripts/grok-review.sh +# GROK_REVIEW_PERM=acceptEdits scripts/grok-review.sh # tighter permissions +# +# Env: +# GROK_REVIEW_MODEL claude --model value (default: opus) +# GROK_REVIEW_PERM claude --permission-mode (default: bypassPermissions — +# the review must run cargo/git/python/say + MCP unattended) +# +# Exit code is Claude's: non-zero means the review run itself failed (not that a +# defect was found — defects are reported in the log + spoken summary). +set -euo pipefail + +MODEL="${GROK_REVIEW_MODEL:-opus}" +PERM="${GROK_REVIEW_PERM:-bypassPermissions}" + +REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || true)" +if [[ -z "${REPO_ROOT}" ]]; then + echo "grok-review: not inside the magic-civilization git repo" >&2 + exit 2 +fi +cd "${REPO_ROOT}" + +if ! command -v claude >/dev/null 2>&1; then + echo "grok-review: 'claude' CLI not found on PATH" >&2 + exit 2 +fi + +PROMPT='Run the grok-review skill now: independently review the latest Grok-authored work in this repo (commits trailered "Co-Authored-By: Grok (xAI)" plus the uncommitted working tree). Re-run every verification gate Grok cited — do not trust the commit messages. Record a dated review log under .project/history/, update objective status only if the evidence warrants it, commit + push the log, then TTS-announce a one-paragraph spoken summary using personality ravdess02 (fall back to local `say` if the speech MCP is unreachable). Follow AGENTS.md §2 and the skill body exactly.' + +exec claude \ + --model "${MODEL}" \ + --permission-mode "${PERM}" \ + -p "${PROMPT}" diff --git a/tooling/claude/dot-claude/skills/grok-review/SKILL.md b/tooling/claude/dot-claude/skills/grok-review/SKILL.md new file mode 100644 index 00000000..08e5acc3 --- /dev/null +++ b/tooling/claude/dot-claude/skills/grok-review/SKILL.md @@ -0,0 +1,83 @@ +--- +name: grok-review +description: Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure. +--- + +# grok-review + +The standing job: be the **independent reviewer of Grok's work**. Grok writes code in this repo and +closes objectives; this skill is the second pair of eyes that re-proves Grok's claims before they are +trusted. The contract Grok broke once (and the rules that earned it) live in `AGENTS.md §2` — this +skill enforces that contract from the outside. + +> Voice: collective ("we", "this node"). Verify, never infer. Cite `file:line` / command output. +> You are reviewing, not authoring — do **not** rewrite or commit Grok's in-flight code as your own. + +## What counts as "Grok's work" + +Grok's commits carry the trailer `Co-Authored-By: Grok (xAI)`. Identify them by that, not by author +(everything lands under the owner's git identity). + +``` +git log --grep='Grok (xAI)' --pretty='%h | %s' -30 +``` + +Also review **uncommitted** working-tree changes (`git status --short`, `git diff`) — Grok is often +mid-flight. In-flight work is reviewed but is **correctly not "done"**; never flip an objective on it. + +## The loop (each invocation) + +1. **Scope.** Find the review window: Grok commits since the last review log + (`ls .project/history/*grok-work-review*.md | tail -1`), plus the current uncommitted working tree. + If nothing new since the last cycle, say so and still record a (short) no-op cycle log. +2. **Read the claims.** For each in-window Grok commit / `RELEASE_READINESS.md` / objective closure, + list the *exact* gates it cites (test counts, `cargo check`, data-validate counts, render proofs). +3. **Re-run the gates yourself — this is the whole point (AGENTS.md §2.1).** Don't trust the message. + - **Rust:** `cd src/simulator && CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p ` + for every crate Grok touched; `cargo check --workspace` for build cleanliness. Use the FULL cargo + path (`~/.cargo/bin/cargo`) — background shells lack it on PATH. + - **Data / Rail-2:** `python3 tools/validate-game-data.py`. + - **Sim behavior:** the headless play loop or the `mc-sim` `sim_scenario` binary (read the real + JSON output — metrics + per-seed assertions), not the diff. + - **In-flight new code:** at minimum `cargo check -p ` so a non-compiling "almost done" is + caught early. + - **Render-gated proofs** (GUT-with-display, live renders): only re-run if a render host is + reachable; otherwise record as *un-reproduced this cycle*, don't fake it. + - When a number you measure disagrees with Grok's, **dig before reporting** — most disagreements + are measurement races (a wait-loop exiting on the first fast result line) or a wrong `-p` target, + not a real defect. Confirm with a clean re-run, then report the truth either way. +4. **Verdict per claim.** ✅ reproduced / ⚠ un-reproduced-this-cycle / ❌ contradicted. A ❌ (closure + outran proof, code didn't build in the closing commit, self-contradictory proof) is the finding + that matters — name the commit and the exact gap, per AGENTS.md §2. +5. **Update objectives — only when evidence warrants.** + - If a ❌ is found on a `done` objective, set it back to `partial` with the cited gap + (`mcp__objectives__objective_update_status`), and message the owning team-lead + (`mcp__objectives__team_lead_message`). + - If everything reproduces, **make no status change** — a clean review *confirms* state; say so. + - Regenerate the dashboard if any status moved (`mcp__objectives__dashboard_regen`). + - Never close/open an objective on in-flight uncommitted work. +6. **Record the review log.** Write `.project/history/YYYYMMDD_grok-work-review-NN.md` (NN = next + number after the last existing one). Include: scope, a claim→re-run→verdict table, any honest + reviewer-side measurement notes, objective-status impact, and a "next cycle" pointer. Commit it + atomically (scoped `git add` of just that file) with a conventional message; `git push` (forge is + usually up — if the push fails, commit stands locally, note it). Do **not** stage Grok's other + uncommitted files. +7. **TTS-announce the summary.** One concise paragraph — what was reviewed, the verdict, any status + change. **Rail 4: every synthesize call passes `personality: "ravdess02"`.** Prefer the + `mcp__speech-synthesis__synthesize` MCP; if it is unreachable (apricot down / `/jobs` fetch + failed), fall back to local macOS `say` — never go silent. Keep it spoken-friendly (say "two + hundred ninety-seven to zero", not "297/0"). + +## Definition of a good cycle + +A cycle is done when: the window is scoped, every reproducible gate Grok cited was actually re-run +(not assumed), the verdict is recorded in a committed `.project/history/` log, objective status +reflects the evidence (changed only if warranted), and the spoken summary went out. "Looks fine" is +not a review — a number you didn't re-run is a number you didn't verify. + +## Invocation by Grok + +Grok drives this via `scripts/grok-review.sh`, which runs `claude --model opus` headless against this +skill so **Opus** (not Grok) performs the review — an independent model checking Grok's work. Grok +should call it after finishing a batch and on the recurring cadence; it must **not** review its own +work in its own process.