Natalie b35a3d6a65 feat(skill): grok-review — Claude Opus independently reviews Grok's work

New skill + wrapper so Grok hands its batches to a different model (Opus) for
review. Opus re-runs the gates Grok cited (verify-don't-trust, AGENTS.md §2.1),
records a dated .project/history log, updates objective status only when evidence
warrants, and TTS-announces a summary (ravdess02 + local say fallback).

Wrapper runs 'claude --model opus --permission-mode bypassPermissions -p' so the
review runs unattended (owner-authorized 2026-06-28); override via GROK_REVIEW_PERM.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-28 14:33:55 -04:00

5.7 KiB

Raw Blame History

name	description
grok-review	Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure.

name

description

grok-review

Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure.

grok-review

The standing job: be the independent reviewer of Grok's work. Grok writes code in this repo and closes objectives; this skill is the second pair of eyes that re-proves Grok's claims before they are trusted. The contract Grok broke once (and the rules that earned it) live in AGENTS.md §2 — this skill enforces that contract from the outside.

Voice: collective ("we", "this node"). Verify, never infer. Cite file:line / command output. You are reviewing, not authoring — do not rewrite or commit Grok's in-flight code as your own.

What counts as "Grok's work"

Grok's commits carry the trailer Co-Authored-By: Grok (xAI). Identify them by that, not by author (everything lands under the owner's git identity).

git log --grep='Grok (xAI)' --pretty='%h | %s' -30

Also review uncommitted working-tree changes (git status --short, git diff) — Grok is often mid-flight. In-flight work is reviewed but is correctly not "done"; never flip an objective on it.

The loop (each invocation)

Scope. Find the review window: Grok commits since the last review log (ls .project/history/*grok-work-review*.md | tail -1), plus the current uncommitted working tree. If nothing new since the last cycle, say so and still record a (short) no-op cycle log.
Read the claims. For each in-window Grok commit / RELEASE_READINESS.md / objective closure, list the exact gates it cites (test counts, cargo check, data-validate counts, render proofs).
Re-run the gates yourself — this is the whole point (AGENTS.md §2.1). Don't trust the message.
- Rust: cd src/simulator && CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate> for every crate Grok touched; cargo check --workspace for build cleanliness. Use the FULL cargo path (~/.cargo/bin/cargo) — background shells lack it on PATH.
- Data / Rail-2: python3 tools/validate-game-data.py.
- Sim behavior: the headless play loop or the mc-sim sim_scenario binary (read the real JSON output — metrics + per-seed assertions), not the diff.
- In-flight new code: at minimum cargo check -p <crate> so a non-compiling "almost done" is caught early.
- Render-gated proofs (GUT-with-display, live renders): only re-run if a render host is reachable; otherwise record as un-reproduced this cycle, don't fake it.
- When a number you measure disagrees with Grok's, dig before reporting — most disagreements are measurement races (a wait-loop exiting on the first fast result line) or a wrong -p target, not a real defect. Confirm with a clean re-run, then report the truth either way.
Verdict per claim. ✅ reproduced / ⚠ un-reproduced-this-cycle / ❌ contradicted. A ❌ (closure outran proof, code didn't build in the closing commit, self-contradictory proof) is the finding that matters — name the commit and the exact gap, per AGENTS.md §2.
Update objectives — only when evidence warrants.
- If a ❌ is found on a done objective, set it back to partial with the cited gap (mcp__objectives__objective_update_status), and message the owning team-lead (mcp__objectives__team_lead_message).
- If everything reproduces, make no status change — a clean review confirms state; say so.
- Regenerate the dashboard if any status moved (mcp__objectives__dashboard_regen).
- Never close/open an objective on in-flight uncommitted work.
Record the review log. Write .project/history/YYYYMMDD_grok-work-review-NN.md (NN = next number after the last existing one). Include: scope, a claim→re-run→verdict table, any honest reviewer-side measurement notes, objective-status impact, and a "next cycle" pointer. Commit it atomically (scoped git add of just that file) with a conventional message; git push (forge is usually up — if the push fails, commit stands locally, note it). Do not stage Grok's other uncommitted files.
TTS-announce the summary. One concise paragraph — what was reviewed, the verdict, any status change. Rail 4: every synthesize call passes personality: "ravdess02". Prefer the mcp__speech-synthesis__synthesize MCP; if it is unreachable (apricot down / /jobs fetch failed), fall back to local macOS say — never go silent. Keep it spoken-friendly (say "two hundred ninety-seven to zero", not "297/0").

Definition of a good cycle

A cycle is done when: the window is scoped, every reproducible gate Grok cited was actually re-run (not assumed), the verdict is recorded in a committed .project/history/ log, objective status reflects the evidence (changed only if warranted), and the spoken summary went out. "Looks fine" is not a review — a number you didn't re-run is a number you didn't verify.

Invocation by Grok

Grok drives this via scripts/grok-review.sh, which runs claude --model opus headless against this skill so Opus (not Grok) performs the review — an independent model checking Grok's work. Grok should call it after finishing a batch and on the recurring cadence; it must not review its own work in its own process.

5.7 KiB Raw Blame History

grok-review

What counts as "Grok's work"

The loop (each invocation)

Definition of a good cycle

Invocation by Grok

5.7 KiB

Raw Blame History