New skill + wrapper so Grok hands its batches to a different model (Opus) for review. Opus re-runs the gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated .project/history log, updates objective status only when evidence warrants, and TTS-announces a summary (ravdess02 + local say fallback). Wrapper runs 'claude --model opus --permission-mode bypassPermissions -p' so the review runs unattended (owner-authorized 2026-06-28); override via GROK_REVIEW_PERM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5.7 KiB
| name | description |
|---|---|
| grok-review | Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure. |
grok-review
The standing job: be the independent reviewer of Grok's work. Grok writes code in this repo and
closes objectives; this skill is the second pair of eyes that re-proves Grok's claims before they are
trusted. The contract Grok broke once (and the rules that earned it) live in AGENTS.md §2 — this
skill enforces that contract from the outside.
Voice: collective ("we", "this node"). Verify, never infer. Cite
file:line/ command output. You are reviewing, not authoring — do not rewrite or commit Grok's in-flight code as your own.
What counts as "Grok's work"
Grok's commits carry the trailer Co-Authored-By: Grok (xAI). Identify them by that, not by author
(everything lands under the owner's git identity).
git log --grep='Grok (xAI)' --pretty='%h | %s' -30
Also review uncommitted working-tree changes (git status --short, git diff) — Grok is often
mid-flight. In-flight work is reviewed but is correctly not "done"; never flip an objective on it.
The loop (each invocation)
- Scope. Find the review window: Grok commits since the last review log
(
ls .project/history/*grok-work-review*.md | tail -1), plus the current uncommitted working tree. If nothing new since the last cycle, say so and still record a (short) no-op cycle log. - Read the claims. For each in-window Grok commit /
RELEASE_READINESS.md/ objective closure, list the exact gates it cites (test counts,cargo check, data-validate counts, render proofs). - Re-run the gates yourself — this is the whole point (AGENTS.md §2.1). Don't trust the message.
- Rust:
cd src/simulator && CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate>for every crate Grok touched;cargo check --workspacefor build cleanliness. Use the FULL cargo path (~/.cargo/bin/cargo) — background shells lack it on PATH. - Data / Rail-2:
python3 tools/validate-game-data.py. - Sim behavior: the headless play loop or the
mc-simsim_scenariobinary (read the real JSON output — metrics + per-seed assertions), not the diff. - In-flight new code: at minimum
cargo check -p <crate>so a non-compiling "almost done" is caught early. - Render-gated proofs (GUT-with-display, live renders): only re-run if a render host is reachable; otherwise record as un-reproduced this cycle, don't fake it.
- When a number you measure disagrees with Grok's, dig before reporting — most disagreements
are measurement races (a wait-loop exiting on the first fast result line) or a wrong
-ptarget, not a real defect. Confirm with a clean re-run, then report the truth either way.
- Rust:
- Verdict per claim. ✅ reproduced / ⚠ un-reproduced-this-cycle / ❌ contradicted. A ❌ (closure outran proof, code didn't build in the closing commit, self-contradictory proof) is the finding that matters — name the commit and the exact gap, per AGENTS.md §2.
- Update objectives — only when evidence warrants.
- If a ❌ is found on a
doneobjective, set it back topartialwith the cited gap (mcp__objectives__objective_update_status), and message the owning team-lead (mcp__objectives__team_lead_message). - If everything reproduces, make no status change — a clean review confirms state; say so.
- Regenerate the dashboard if any status moved (
mcp__objectives__dashboard_regen). - Never close/open an objective on in-flight uncommitted work.
- If a ❌ is found on a
- Record the review log. Write
.project/history/YYYYMMDD_grok-work-review-NN.md(NN = next number after the last existing one). Include: scope, a claim→re-run→verdict table, any honest reviewer-side measurement notes, objective-status impact, and a "next cycle" pointer. Commit it atomically (scopedgit addof just that file) with a conventional message;git push(forge is usually up — if the push fails, commit stands locally, note it). Do not stage Grok's other uncommitted files. - TTS-announce the summary. One concise paragraph — what was reviewed, the verdict, any status
change. Rail 4: every synthesize call passes
personality: "ravdess02". Prefer themcp__speech-synthesis__synthesizeMCP; if it is unreachable (apricot down //jobsfetch failed), fall back to local macOSsay— never go silent. Keep it spoken-friendly (say "two hundred ninety-seven to zero", not "297/0").
Definition of a good cycle
A cycle is done when: the window is scoped, every reproducible gate Grok cited was actually re-run
(not assumed), the verdict is recorded in a committed .project/history/ log, objective status
reflects the evidence (changed only if warranted), and the spoken summary went out. "Looks fine" is
not a review — a number you didn't re-run is a number you didn't verify.
Invocation by Grok
Grok drives this via scripts/grok-review.sh, which runs claude --model opus headless against this
skill so Opus (not Grok) performs the review — an independent model checking Grok's work. Grok
should call it after finishing a batch and on the recurring cadence; it must not review its own
work in its own process.