feat(skill): grok-review — Claude Opus independently reviews Grok's work
New skill + wrapper so Grok hands its batches to a different model (Opus) for review. Opus re-runs the gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated .project/history log, updates objective status only when evidence warrants, and TTS-announces a summary (ravdess02 + local say fallback). Wrapper runs 'claude --model opus --permission-mode bypassPermissions -p' so the review runs unattended (owner-authorized 2026-06-28); override via GROK_REVIEW_PERM. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
52c71010c3
commit
b35a3d6a65
2 changed files with 128 additions and 0 deletions
45
scripts/grok-review.sh
Executable file
45
scripts/grok-review.sh
Executable file
|
|
@ -0,0 +1,45 @@
|
|||
#!/usr/bin/env bash
|
||||
#
|
||||
# grok-review.sh — have Claude Opus independently review Grok's work.
|
||||
#
|
||||
# Grok invokes this to hand its work to a *different* model (Opus) for review:
|
||||
# Opus runs the `grok-review` skill, which re-runs the verification gates Grok
|
||||
# cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log under
|
||||
# .project/history/, updates objective status only when the evidence warrants,
|
||||
# and TTS-announces a one-paragraph summary (ravdess02, local `say` fallback).
|
||||
#
|
||||
# Usage:
|
||||
# scripts/grok-review.sh # review the current window, headless
|
||||
# GROK_REVIEW_MODEL=opus scripts/grok-review.sh
|
||||
# GROK_REVIEW_PERM=acceptEdits scripts/grok-review.sh # tighter permissions
|
||||
#
|
||||
# Env:
|
||||
# GROK_REVIEW_MODEL claude --model value (default: opus)
|
||||
# GROK_REVIEW_PERM claude --permission-mode (default: bypassPermissions —
|
||||
# the review must run cargo/git/python/say + MCP unattended)
|
||||
#
|
||||
# Exit code is Claude's: non-zero means the review run itself failed (not that a
|
||||
# defect was found — defects are reported in the log + spoken summary).
|
||||
set -euo pipefail
|
||||
|
||||
MODEL="${GROK_REVIEW_MODEL:-opus}"
|
||||
PERM="${GROK_REVIEW_PERM:-bypassPermissions}"
|
||||
|
||||
REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || true)"
|
||||
if [[ -z "${REPO_ROOT}" ]]; then
|
||||
echo "grok-review: not inside the magic-civilization git repo" >&2
|
||||
exit 2
|
||||
fi
|
||||
cd "${REPO_ROOT}"
|
||||
|
||||
if ! command -v claude >/dev/null 2>&1; then
|
||||
echo "grok-review: 'claude' CLI not found on PATH" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
PROMPT='Run the grok-review skill now: independently review the latest Grok-authored work in this repo (commits trailered "Co-Authored-By: Grok (xAI)" plus the uncommitted working tree). Re-run every verification gate Grok cited — do not trust the commit messages. Record a dated review log under .project/history/, update objective status only if the evidence warrants it, commit + push the log, then TTS-announce a one-paragraph spoken summary using personality ravdess02 (fall back to local `say` if the speech MCP is unreachable). Follow AGENTS.md §2 and the skill body exactly.'
|
||||
|
||||
exec claude \
|
||||
--model "${MODEL}" \
|
||||
--permission-mode "${PERM}" \
|
||||
-p "${PROMPT}"
|
||||
83
tooling/claude/dot-claude/skills/grok-review/SKILL.md
Normal file
83
tooling/claude/dot-claude/skills/grok-review/SKILL.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
name: grok-review
|
||||
description: Independent Claude-Opus review of Grok-authored work. Scans Grok commits + in-flight working tree, RE-RUNS the verification gates Grok cited (verify-don't-trust, AGENTS.md §2.1), records a dated review log, updates objective status only when the evidence warrants it, and TTS-announces a one-paragraph summary. Use when the owner says "review grok's work", on the recurring 30-min review cadence, or when Grok invokes `scripts/grok-review.sh` to have Opus check its own batch before/after closure.
|
||||
---
|
||||
|
||||
# grok-review
|
||||
|
||||
The standing job: be the **independent reviewer of Grok's work**. Grok writes code in this repo and
|
||||
closes objectives; this skill is the second pair of eyes that re-proves Grok's claims before they are
|
||||
trusted. The contract Grok broke once (and the rules that earned it) live in `AGENTS.md §2` — this
|
||||
skill enforces that contract from the outside.
|
||||
|
||||
> Voice: collective ("we", "this node"). Verify, never infer. Cite `file:line` / command output.
|
||||
> You are reviewing, not authoring — do **not** rewrite or commit Grok's in-flight code as your own.
|
||||
|
||||
## What counts as "Grok's work"
|
||||
|
||||
Grok's commits carry the trailer `Co-Authored-By: Grok (xAI)`. Identify them by that, not by author
|
||||
(everything lands under the owner's git identity).
|
||||
|
||||
```
|
||||
git log --grep='Grok (xAI)' --pretty='%h | %s' -30
|
||||
```
|
||||
|
||||
Also review **uncommitted** working-tree changes (`git status --short`, `git diff`) — Grok is often
|
||||
mid-flight. In-flight work is reviewed but is **correctly not "done"**; never flip an objective on it.
|
||||
|
||||
## The loop (each invocation)
|
||||
|
||||
1. **Scope.** Find the review window: Grok commits since the last review log
|
||||
(`ls .project/history/*grok-work-review*.md | tail -1`), plus the current uncommitted working tree.
|
||||
If nothing new since the last cycle, say so and still record a (short) no-op cycle log.
|
||||
2. **Read the claims.** For each in-window Grok commit / `RELEASE_READINESS.md` / objective closure,
|
||||
list the *exact* gates it cites (test counts, `cargo check`, data-validate counts, render proofs).
|
||||
3. **Re-run the gates yourself — this is the whole point (AGENTS.md §2.1).** Don't trust the message.
|
||||
- **Rust:** `cd src/simulator && CARGO_PROFILE_DEV_DEBUG=0 CARGO_PROFILE_TEST_DEBUG=0 cargo test -p <crate>`
|
||||
for every crate Grok touched; `cargo check --workspace` for build cleanliness. Use the FULL cargo
|
||||
path (`~/.cargo/bin/cargo`) — background shells lack it on PATH.
|
||||
- **Data / Rail-2:** `python3 tools/validate-game-data.py`.
|
||||
- **Sim behavior:** the headless play loop or the `mc-sim` `sim_scenario` binary (read the real
|
||||
JSON output — metrics + per-seed assertions), not the diff.
|
||||
- **In-flight new code:** at minimum `cargo check -p <crate>` so a non-compiling "almost done" is
|
||||
caught early.
|
||||
- **Render-gated proofs** (GUT-with-display, live renders): only re-run if a render host is
|
||||
reachable; otherwise record as *un-reproduced this cycle*, don't fake it.
|
||||
- When a number you measure disagrees with Grok's, **dig before reporting** — most disagreements
|
||||
are measurement races (a wait-loop exiting on the first fast result line) or a wrong `-p` target,
|
||||
not a real defect. Confirm with a clean re-run, then report the truth either way.
|
||||
4. **Verdict per claim.** ✅ reproduced / ⚠ un-reproduced-this-cycle / ❌ contradicted. A ❌ (closure
|
||||
outran proof, code didn't build in the closing commit, self-contradictory proof) is the finding
|
||||
that matters — name the commit and the exact gap, per AGENTS.md §2.
|
||||
5. **Update objectives — only when evidence warrants.**
|
||||
- If a ❌ is found on a `done` objective, set it back to `partial` with the cited gap
|
||||
(`mcp__objectives__objective_update_status`), and message the owning team-lead
|
||||
(`mcp__objectives__team_lead_message`).
|
||||
- If everything reproduces, **make no status change** — a clean review *confirms* state; say so.
|
||||
- Regenerate the dashboard if any status moved (`mcp__objectives__dashboard_regen`).
|
||||
- Never close/open an objective on in-flight uncommitted work.
|
||||
6. **Record the review log.** Write `.project/history/YYYYMMDD_grok-work-review-NN.md` (NN = next
|
||||
number after the last existing one). Include: scope, a claim→re-run→verdict table, any honest
|
||||
reviewer-side measurement notes, objective-status impact, and a "next cycle" pointer. Commit it
|
||||
atomically (scoped `git add` of just that file) with a conventional message; `git push` (forge is
|
||||
usually up — if the push fails, commit stands locally, note it). Do **not** stage Grok's other
|
||||
uncommitted files.
|
||||
7. **TTS-announce the summary.** One concise paragraph — what was reviewed, the verdict, any status
|
||||
change. **Rail 4: every synthesize call passes `personality: "ravdess02"`.** Prefer the
|
||||
`mcp__speech-synthesis__synthesize` MCP; if it is unreachable (apricot down / `/jobs` fetch
|
||||
failed), fall back to local macOS `say` — never go silent. Keep it spoken-friendly (say "two
|
||||
hundred ninety-seven to zero", not "297/0").
|
||||
|
||||
## Definition of a good cycle
|
||||
|
||||
A cycle is done when: the window is scoped, every reproducible gate Grok cited was actually re-run
|
||||
(not assumed), the verdict is recorded in a committed `.project/history/` log, objective status
|
||||
reflects the evidence (changed only if warranted), and the spoken summary went out. "Looks fine" is
|
||||
not a review — a number you didn't re-run is a number you didn't verify.
|
||||
|
||||
## Invocation by Grok
|
||||
|
||||
Grok drives this via `scripts/grok-review.sh`, which runs `claude --model opus` headless against this
|
||||
skill so **Opus** (not Grok) performs the review — an independent model checking Grok's work. Grok
|
||||
should call it after finishing a batch and on the recurring cadence; it must **not** review its own
|
||||
work in its own process.
|
||||
Loading…
Add table
Reference in a new issue