feat(skill): grok-review — Claude Opus independently reviews Grok's work

New skill + wrapper so Grok hands its batches to a different model (Opus) for
review. Opus re-runs the gates Grok cited (verify-don't-trust, AGENTS.md §2.1),
records a dated .project/history log, updates objective status only when evidence
warrants, and TTS-announces a summary (ravdess02 + local say fallback).

Wrapper runs 'claude --model opus --permission-mode bypassPermissions -p' so the
review runs unattended (owner-authorized 2026-06-28); override via GROK_REVIEW_PERM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Natalie 2026-06-28 14:33:55 -04:00
parent 52c71010c3
commit b35a3d6a65
2 changed files with 128 additions and 0 deletions

45
scripts/grok-review.sh Executable file
View file

@ -0,0 +1,45 @@
#!/usr/bin/env bash
#
# grok-review.sh — have Claude Opus independently review Grok's work.
#
# Grok invokes this to hand its work to a *different* model (Opus) for review:
# Opus runs the `grok-review` skill, which re-runs the verification gates Grok
# cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log under
# .project/history/, updates objective status only when the evidence warrants,
# and TTS-announces a one-paragraph summary (ravdess02, local `say` fallback).
#
# Usage:
# scripts/grok-review.sh # review the current window, headless
# GROK_REVIEW_MODEL=opus scripts/grok-review.sh
# GROK_REVIEW_PERM=acceptEdits scripts/grok-review.sh # tighter permissions
#
# Env:
# GROK_REVIEW_MODEL claude --model value (default: opus)
# GROK_REVIEW_PERM claude --permission-mode (default: bypassPermissions —
# the review must run cargo/git/python/say + MCP unattended)
#
# Exit code is Claude's: non-zero means the review run itself failed (not that a
# defect was found — defects are reported in the log + spoken summary).
set -euo pipefail
MODEL="${GROK_REVIEW_MODEL:-opus}"
PERM="${GROK_REVIEW_PERM:-bypassPermissions}"
REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || true)"
if [[ -z "${REPO_ROOT}" ]]; then
echo "grok-review: not inside the magic-civilization git repo" >&2
exit 2
fi
cd "${REPO_ROOT}"
if ! command -v claude >/dev/null 2>&1; then
echo "grok-review: 'claude' CLI not found on PATH" >&2
exit 2
fi
PROMPT='Run the grok-review skill now: independently review the latest Grok-authored work in this repo (commits trailered "Co-Authored-By: Grok (xAI)" plus the uncommitted working tree). Re-run every verification gate Grok cited — do not trust the commit messages. Record a dated review log under .project/history/, update objective status only if the evidence warrants it, commit + push the log, then TTS-announce a one-paragraph spoken summary using personality ravdess02 (fall back to local `say` if the speech MCP is unreachable). Follow AGENTS.md §2 and the skill body exactly.'
exec claude \
--model "${MODEL}" \
--permission-mode "${PERM}" \
-p "${PROMPT}"

View file

@ -0,0 +1,83 @@
---
name: grok-review
description: Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure.
---
# grok-review
The standing job: be the **independent reviewer of Grok's work**. Grok writes code in this repo and
closes objectives; this skill is the second pair of eyes that re-proves Grok's claims before they are
trusted. The contract Grok broke once (and the rules that earned it) live in `AGENTS.md §2` — this
skill enforces that contract from the outside.
> Voice: collective ("we", "this node"). Verify, never infer. Cite `file:line` / command output.
> You are reviewing, not authoring — do **not** rewrite or commit Grok's in-flight code as your own.
## What counts as "Grok's work"
Grok's commits carry the trailer `Co-Authored-By: Grok (xAI)`. Identify them by that, not by author
(everything lands under the owner's git identity).
```
git log --grep='Grok (xAI)' --pretty='%h | %s' -30
```
Also review **uncommitted** working-tree changes (`git status --short`, `git diff`) — Grok is often
mid-flight. In-flight work is reviewed but is **correctly not "done"**; never flip an objective on it.
## The loop (each invocation)
1. **Scope.** Find the review window: Grok commits since the last review log
(`ls .project/history/*grok-work-review*.md | tail -1`), plus the current uncommitted working tree.
If nothing new since the last cycle, say so and still record a (short) no-op cycle log.
2. **Read the claims.** For each in-window Grok commit / `RELEASE_READINESS.md` / objective closure,
list the *exact* gates it cites (test counts, `cargo check`, data-validate counts, render proofs).
3. **Re-run the gates yourself — this is the whole point (AGENTS.md §2.1).** Don't trust the message.
- **Rust:** `cd src/simulator && CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate>`
for every crate Grok touched; `cargo check --workspace` for build cleanliness. Use the FULL cargo
path (`~/.cargo/bin/cargo`) — background shells lack it on PATH.
- **Data / Rail-2:** `python3 tools/validate-game-data.py`.
- **Sim behavior:** the headless play loop or the `mc-sim` `sim_scenario` binary (read the real
JSON output — metrics + per-seed assertions), not the diff.
- **In-flight new code:** at minimum `cargo check -p <crate>` so a non-compiling "almost done" is
caught early.
- **Render-gated proofs** (GUT-with-display, live renders): only re-run if a render host is
reachable; otherwise record as *un-reproduced this cycle*, don't fake it.
- When a number you measure disagrees with Grok's, **dig before reporting** — most disagreements
are measurement races (a wait-loop exiting on the first fast result line) or a wrong `-p` target,
not a real defect. Confirm with a clean re-run, then report the truth either way.
4. **Verdict per claim.** ✅ reproduced / ⚠ un-reproduced-this-cycle / ❌ contradicted. A ❌ (closure
outran proof, code didn't build in the closing commit, self-contradictory proof) is the finding
that matters — name the commit and the exact gap, per AGENTS.md §2.
5. **Update objectives — only when evidence warrants.**
- If a ❌ is found on a `done` objective, set it back to `partial` with the cited gap
(`mcp__objectives__objective_update_status`), and message the owning team-lead
(`mcp__objectives__team_lead_message`).
- If everything reproduces, **make no status change** — a clean review *confirms* state; say so.
- Regenerate the dashboard if any status moved (`mcp__objectives__dashboard_regen`).
- Never close/open an objective on in-flight uncommitted work.
6. **Record the review log.** Write `.project/history/YYYYMMDD_grok-work-review-NN.md` (NN = next
number after the last existing one). Include: scope, a claim→re-run→verdict table, any honest
reviewer-side measurement notes, objective-status impact, and a "next cycle" pointer. Commit it
atomically (scoped `git add` of just that file) with a conventional message; `git push` (forge is
usually up — if the push fails, commit stands locally, note it). Do **not** stage Grok's other
uncommitted files.
7. **TTS-announce the summary.** One concise paragraph — what was reviewed, the verdict, any status
change. **Rail 4: every synthesize call passes `personality: "ravdess02"`.** Prefer the
`mcp__speech-synthesis__synthesize` MCP; if it is unreachable (apricot down / `/jobs` fetch
failed), fall back to local macOS `say` — never go silent. Keep it spoken-friendly (say "two
hundred ninety-seven to zero", not "297/0").
## Definition of a good cycle
A cycle is done when: the window is scoped, every reproducible gate Grok cited was actually re-run
(not assumed), the verdict is recorded in a committed `.project/history/` log, objective status
reflects the evidence (changed only if warranted), and the spoken summary went out. "Looks fine" is
not a review — a number you didn't re-run is a number you didn't verify.
## Invocation by Grok
Grok drives this via `scripts/grok-review.sh`, which runs `claude --model opus` headless against this
skill so **Opus** (not Grok) performs the review — an independent model checking Grok's work. Grok
should call it after finishing a batch and on the recurring cadence; it must **not** review its own
work in its own process.