Multi-LLM PR review
A cross-family second opinion on every PR. Runs in CI, posts findings as inline diff comments anchored to the flagged lines, and never approves or requests changes — humans own merge.
Why a second opinion
Same-family review (Claude reviewing Claude) shares blind spots. A different model family catches different things — rule hallucinations, missed tenancy wrapping, dropped error paths, security regressions one family is structurally biased toward missing. The cost is one extra LLM call per PR push, bounded by a per-PR token budget; the benefit is a second pair of eyes that consistently disagrees with the first.
INF-63 shipped the workflow scaffold. INF-126 shipped the reviewer script that the workflow invokes. INF-155 added SSE streaming, inline diff comments, and neutral check-run signalling. The reviewer has been enabled in CI since 2026-06-02 on openai/gpt-5.5 and runs on every PR push. See How it was enabled for historical context.
What it does
On every PR push (opened, synchronize, reopened, ready_for_review):
.github/workflows/multi-llm-review.ymlchecks gating (feature flag, secret, escape-hatch markers, draft / fork / dependabot exclusions).- If gating passes, it invokes
scripts/multi-llm-review.tswith the PR number, base SHA, head SHA, the configured model list, and the per-PR token budget. - The script:
- Computes
git diff <base>...<head>. - Reads
.ai/constitution.mdand any.ai/specs/SPEC-*.mdpaths it can extract from the PR body (regex + existence-check). - For each configured model: builds a prompt (constitution + specs + diff + reviewer rules), streams the response from the Vercel AI Gateway (SSE), parses the model's findings from a fenced JSON block, and posts findings as inline diff comments via the GitHub Reviews API.
- Tracks token usage across models; aborts cleanly with a "budget exhausted" notice on the PR review if the budget would be exceeded.
- When findings exist, creates a neutral check-run in the PR checks panel so findings are visible at a glance without blocking merge.
- Computes
Every finding is required by the prompt to cite the spec heading or AGENTS.md / constitution clause it flags. Uncited findings are dropped on the client side — reviewer noise is the failure mode we care most about.
How findings are posted
Findings are posted as inline diff comments anchored directly to the changed line where the issue was found. This means reviewers see the annotation right on the relevant code, not in a separate comment thread.
Anchorable findings — those whose path:line location falls within an actual changed hunk — appear as inline comments on the RIGHT (new) side of the diff.
Non-anchorable findings — those with no location, a hallucinated path, or a line number outside any changed hunk — fall back into the top-level review body summary. This ensures that off-target model guesses do not cause the entire review POST to fail (the GitHub API rejects inline comments on lines not in the diff).
A neutral check-run named Multi-LLM review findings is created in the PR checks panel whenever findings exist. Neutral never blocks merge — it provides an informational signal ("N findings posted — advisory") without turning the check red. Clean reviews (zero findings) and skipped/error runs do not create a check-run; the existing green status is preserved.
How the idle-timeout model works (streaming)
Gateway calls use SSE streaming with an idle/inactivity timer rather than a hard wall-clock timeout. The timer resets on every received chunk — a slow-but-progressing reasoning trace is never killed. Only a truly stalled upstream (no bytes arriving for GATEWAY_IDLE_TIMEOUT_MS = 60_000 ms) aborts the request.
This is strictly better than the previous wall-clock cap for reasoning models like openai/gpt-5 that may take well over 60 seconds to produce a full response on large diffs, but do so steadily without stalling. A stall (network partition, gateway unresponsive mid-generation) is still caught quickly via the idle timer.
The workflow's timeout-minutes: 8 provides an outer wall-clock guard for the job as a whole.
How it was enabled (one-time repo-admin step)
The reviewer was enabled on 2026-06-02. The one-time setup was:
- Generated a Vercel AI Gateway token with access to
openai/gpt-5.5and added it as theVERCEL_AI_GATEWAY_TOKENrepo secret. - Set
MULTI_LLM_REVIEW_ENABLED=trueas a repo variable. - Set
MULTI_LLM_REVIEW_MODELS=openai/gpt-5.5as a repo variable.
The reviewer now runs on every PR push automatically. To tune the knobs:
MULTI_LLM_REVIEW_MODELS— comma-separated model ids. Current:openai/gpt-5.5.MULTI_LLM_REVIEW_TOKEN_BUDGET— max total tokens per PR push. Default:100000.
How to skip a PR
Add [skip ai-review] to the PR title — any substring match counts here, as with every other CI bypass marker in the repo (per bypass-marker matching) — or on its own line in the PR body. Inline backticked or prose mentions in the body do not trigger the skip; that strict-line rule only applies to the body, not the title. If your PR title needs to discuss the marker without skipping the review, put the marker in a backticked code span and the workflow will still match it — wrap it in a different way (e.g. skip-ai-review) if you need to mention the concept without triggering.
The workflow is also automatically skipped for:
- Draft PRs.
- PRs from forks (the secret is not available to fork PRs, by design).
- PRs opened by
dependabot[bot].
How the budget works
The token budget is the cumulative ceiling per PR push across all configured models. The script:
- Estimates the prompt size (≈4 chars/token) before each gateway call.
- Skips remaining models cleanly if the estimate would overflow the budget — a "Skipped: token budget exhausted" review is posted so reviewers see the abort, rather than silent omission.
- Records the gateway's reported
usage.total_tokensfrom the terminal SSE usage chunk after each successful call. Falls back toestimateTokenswhen the gateway omits the usage chunk (some providers do not supportstream_options). - Logs the running balance to the workflow's step output.
The default 100k/PR comfortably accommodates the constitution (~10k) + a typical spec (~5k) + a 30k diff + the response. Bigger refactor PRs may hit the budget; raise MULTI_LLM_REVIEW_TOKEN_BUDGET per-PR (gh variable set MULTI_LLM_REVIEW_TOKEN_BUDGET --body 250000) or [skip ai-review] them.
Failure modes (mostly soft-fail)
The review is advisory. Once enabled, the bot must not block PRs on flaky upstreams. Every plausible runtime failure exits 0 with a workflow-log warning. The exception: a small set of bootstrap failures exit non-zero because they mean the script literally cannot run.
| Mode | Behaviour |
|---|---|
VERCEL_AI_GATEWAY_TOKEN missing | Workflow logs a warning and exits 0. |
MULTI_LLM_REVIEW_ENABLED != true | Workflow logs a notice and exits 0. |
[skip ai-review] in title or body | Workflow logs a notice and exits 0. |
| Gateway returns 5xx / idle timeout | Script posts a "Skipped: gateway error" review and continues. |
| Model returns malformed JSON | Script treats it as zero findings (better under-report than crash). |
| Budget exhausted before a model could run | Script posts a "Skipped: budget exhausted" review for that model. |
| Gateway stream stalls (no bytes for 60s) | Idle timer fires; treated like any other gateway error (posted as Skipped). |
| Diff is empty | Script logs and exits 0 (nothing to review). |
| GitHub Reviews API rejects inline comment | Script logs a warning and returns posted: false; no neutral check-run. |
| GitHub check-runs API returns 403/error | Warning logged; never throws; missing checks: write scope swallowed. |
Bootstrap failures (exit 1)
A small set of failures do exit non-zero, because they mean the script never got to the point where soft-failing would be honest:
| Mode | Behaviour |
|---|---|
Malformed CLI args (--pr 0, missing --token-budget, …) | Throws from parseCliArgs; script exits 1. Almost always a workflow bug, not a PR bug. |
gh pr view <pr> --json title,body fails (auth, rate-limit, 404) | Throws; script exits 1 — no PR metadata means we cannot even build the prompt. |
git diff <base>...<head> fails (corrupt fetch, missing ref) | Throws; script exits 1 — no diff means there is literally nothing to review. |
Once the script gets past these three points, every subsequent failure is soft. If you see the Multi-LLM review check go red, look at the workflow log — it's almost certainly one of the three rows above.
Soak protocol (ongoing quality signal)
The reviewer has been live since 2026-06-02. The soak protocol (originally the activation gate) continues as an ongoing quality signal for deciding when to expand the panel:
- Watch the bot's inline review comments on merged PRs.
- React with 👍 on findings that genuinely helped.
- React with 👎 on findings that were noise (uncited, wrong, or pedantic style preferences).
- Decision gate for v2 (adding a second model):
- If 👍 outweighs 👎 across 5+ PRs → schedule v2 (add
google/gemini-2.0-proas a second reviewer — "panel of judges"). - If 👎 outweighs 👍 → tighten the system prompt's "what counts as a finding" rules and re-soak.
- If 👍 outweighs 👎 across 5+ PRs → schedule v2 (add
The thumbs-up rate is the only signal we trust for whether the feature is paying for itself. The INF-163 offline eval harness (scripts/review-eval/) adds a complementary objective signal: precision and recall against a labelled benchmark of real historic Constellation defects.
v2 sketch — panel of judges (deferred)
The YAML scaffold and the script both already loop over --models. v2 adds a second model to the list and lets the workflow post two sets of inline reviews per PR. Open questions for when v2 is up:
- Do we want a third "summariser" pass that reconciles findings across the two reviewers, or is two raw review sets fine?
- Do we want to surface a combined neutral check verdict that aggregates all models' finding counts?
Defer until v1 has soaked successfully.
Where it lives
| File | What |
|---|---|
.github/workflows/multi-llm-review.yml | The workflow. Owns gating, environment, escape hatches. |
scripts/multi-llm-review.ts | The reviewer script. Pure I/O at the edges, pure functions for unit-tested logic. |
scripts/multi-llm-review.test.ts | Unit tests: arg parsing, spec extraction, budget tracker, prompt assembly, rendering, streaming, inline posting. |
.ai/specs/SPEC-inf-126-multi-llm-review-activation.md | The activation spec. |
.ai/specs/SPEC-inf-155-multillm-stream-inline.md | The streaming + inline comments spec. |
scripts/review-eval/ | Offline eval harness (INF-168). Scores any reviewer arm against a labelled benchmark; reports precision + recall. |
scripts/review-eval/benchmark/seed.json | 3 seed labelled cases from real historic Constellation defects. |
.ai/specs/SPEC-inf-168-review-eval-harness.md | Spec for the offline eval harness (Phase 0 of INF-163). |