Multi-LLM PR review

A cross-family second opinion on every PR. Runs in CI, posts findings as inline diff comments anchored to the flagged lines, and never approves or requests changes — humans own merge.

Why a second opinion

Same-family review (Claude reviewing Claude) shares blind spots. A different model family catches different things — rule hallucinations, missed tenancy wrapping, dropped error paths, security regressions one family is structurally biased toward missing. The cost is one extra LLM call per PR push, bounded by a per-PR token budget; the benefit is a second pair of eyes that consistently disagrees with the first.

INF-63 shipped the workflow scaffold. INF-126 shipped the reviewer script that the workflow invokes. INF-155 added SSE streaming, inline diff comments, and neutral check-run signalling. The reviewer has been enabled in CI since 2026-06-02 on openai/gpt-5.5 and runs on every PR push. See How it was enabled for historical context.

What it does

On every PR push (opened, synchronize, reopened, ready_for_review):

.github/workflows/multi-llm-review.yml checks gating (feature flag, secret, escape-hatch markers, draft / fork / dependabot exclusions).
If gating passes, it invokes scripts/multi-llm-review.ts with the PR number, base SHA, head SHA, the configured model list, and the per-PR token budget.
The script:
- Computes git diff <base>...<head>.
- Reads .ai/constitution.md and any .ai/specs/SPEC-*.md paths it can extract from the PR body (regex + existence-check).
- For each configured model: builds a prompt (constitution + specs + diff + reviewer rules), streams the response from the Vercel AI Gateway (SSE), parses the model's findings from a fenced JSON block, and posts findings as inline diff comments via the GitHub Reviews API.
- Tracks token usage across models; aborts cleanly with a "budget exhausted" notice on the PR review if the budget would be exceeded.
- When findings exist, creates a neutral check-run in the PR checks panel so findings are visible at a glance without blocking merge.

Every finding is required by the prompt to cite the spec heading or AGENTS.md / constitution clause it flags. Uncited findings are dropped on the client side — reviewer noise is the failure mode we care most about.

How findings are posted

Findings are posted as inline diff comments anchored directly to the changed line where the issue was found. This means reviewers see the annotation right on the relevant code, not in a separate comment thread.

Anchorable findings — those whose path:line location falls within an actual changed hunk — appear as inline comments on the RIGHT (new) side of the diff.

Non-anchorable findings — those with no location, a hallucinated path, or a line number outside any changed hunk — fall back into the top-level review body summary. This ensures that off-target model guesses do not cause the entire review POST to fail (the GitHub API rejects inline comments on lines not in the diff).

A neutral check-run named Multi-LLM review findings is created in the PR checks panel whenever findings exist. Neutral never blocks merge — it provides an informational signal ("N findings posted — advisory") without turning the check red. Clean reviews (zero findings) and skipped/error runs do not create a check-run; the existing green status is preserved.

How the idle-timeout model works (streaming)

Gateway calls use SSE streaming with an idle/inactivity timer rather than a hard wall-clock timeout. The timer resets on every received chunk — a slow-but-progressing reasoning trace is never killed. Only a truly stalled upstream (no bytes arriving for GATEWAY_IDLE_TIMEOUT_MS = 60_000 ms) aborts the request.

This is strictly better than the previous wall-clock cap for reasoning models like openai/gpt-5 that may take well over 60 seconds to produce a full response on large diffs, but do so steadily without stalling. A stall (network partition, gateway unresponsive mid-generation) is still caught quickly via the idle timer.

The workflow's timeout-minutes: 8 provides an outer wall-clock guard for the job as a whole.

How it was enabled (one-time repo-admin step)

The reviewer was enabled on 2026-06-02. The one-time setup was:

Generated a Vercel AI Gateway token with access to openai/gpt-5.5 and added it as the VERCEL_AI_GATEWAY_TOKEN repo secret.
Set MULTI_LLM_REVIEW_ENABLED=true as a repo variable.
Set MULTI_LLM_REVIEW_MODELS=openai/gpt-5.5 as a repo variable.

The reviewer now runs on every PR push automatically. To tune the knobs:

MULTI_LLM_REVIEW_MODELS — comma-separated model ids. Current: openai/gpt-5.5.
MULTI_LLM_REVIEW_TOKEN_BUDGET — max total tokens per PR push. Default: 100000.

How to skip a PR

Add [skip ai-review] to the PR title — any substring match counts here, as with every other CI bypass marker in the repo (per bypass-marker matching) — or on its own line in the PR body. Inline backticked or prose mentions in the body do not trigger the skip; that strict-line rule only applies to the body, not the title. If your PR title needs to discuss the marker without skipping the review, put the marker in a backticked code span and the workflow will still match it — wrap it in a different way (e.g. skip-ai-review) if you need to mention the concept without triggering.

The workflow is also automatically skipped for:

Draft PRs.
PRs from forks (the secret is not available to fork PRs, by design).
PRs opened by dependabot[bot].

How the budget works

The token budget is the cumulative ceiling per PR push across all configured models. The script:

Estimates the prompt size (≈4 chars/token) before each gateway call.
Skips remaining models cleanly if the estimate would overflow the budget — a "Skipped: token budget exhausted" review is posted so reviewers see the abort, rather than silent omission.
Records the gateway's reported usage.total_tokens from the terminal SSE usage chunk after each successful call. Falls back to estimateTokens when the gateway omits the usage chunk (some providers do not support stream_options).
Logs the running balance to the workflow's step output.

The default 100k/PR comfortably accommodates the constitution (~10k) + a typical spec (~5k) + a 30k diff + the response. Bigger refactor PRs may hit the budget; raise MULTI_LLM_REVIEW_TOKEN_BUDGET per-PR (gh variable set MULTI_LLM_REVIEW_TOKEN_BUDGET --body 250000) or [skip ai-review] them.

Failure modes (mostly soft-fail)

The review is advisory. Once enabled, the bot must not block PRs on flaky upstreams. Every plausible runtime failure exits 0 with a workflow-log warning. The exception: a small set of bootstrap failures exit non-zero because they mean the script literally cannot run.

Mode	Behaviour
`VERCEL_AI_GATEWAY_TOKEN` missing	Workflow logs a warning and exits 0.
`MULTI_LLM_REVIEW_ENABLED != true`	Workflow logs a notice and exits 0.
`[skip ai-review]` in title or body	Workflow logs a notice and exits 0.
Gateway returns 5xx / idle timeout	Script posts a "Skipped: gateway error" review and continues.
Model returns malformed JSON	Script treats it as zero findings (better under-report than crash).
Budget exhausted before a model could run	Script posts a "Skipped: budget exhausted" review for that model.
Gateway stream stalls (no bytes for 60s)	Idle timer fires; treated like any other gateway error (posted as Skipped).
Diff is empty	Script logs and exits 0 (nothing to review).
GitHub Reviews API rejects inline comment	Script logs a warning and returns `posted: false`; no neutral check-run.
GitHub check-runs API returns 403/error	Warning logged; never throws; missing `checks: write` scope swallowed.

Bootstrap failures (exit 1)

A small set of failures do exit non-zero, because they mean the script never got to the point where soft-failing would be honest:

Mode	Behaviour
Malformed CLI args (`--pr 0`, missing `--token-budget`, …)	Throws from `parseCliArgs`; script exits 1. Almost always a workflow bug, not a PR bug.
`gh pr view <pr> --json title,body` fails (auth, rate-limit, 404)	Throws; script exits 1 — no PR metadata means we cannot even build the prompt.
`git diff <base>...<head>` fails (corrupt fetch, missing ref)	Throws; script exits 1 — no diff means there is literally nothing to review.

Once the script gets past these three points, every subsequent failure is soft. If you see the Multi-LLM review check go red, look at the workflow log — it's almost certainly one of the three rows above.

Soak protocol (ongoing quality signal)

The reviewer has been live since 2026-06-02. The soak protocol (originally the activation gate) continues as an ongoing quality signal for deciding when to expand the panel:

Watch the bot's inline review comments on merged PRs.
React with 👍 on findings that genuinely helped.
React with 👎 on findings that were noise (uncited, wrong, or pedantic style preferences).
Decision gate for v2 (adding a second model):
- If 👍 outweighs 👎 across 5+ PRs → schedule v2 (add google/gemini-2.0-pro as a second reviewer — "panel of judges").
- If 👎 outweighs 👍 → tighten the system prompt's "what counts as a finding" rules and re-soak.

The thumbs-up rate is the only signal we trust for whether the feature is paying for itself. The INF-163 offline eval harness (scripts/review-eval/) adds a complementary objective signal: precision and recall against a labelled benchmark of real historic Constellation defects.

v2 sketch — panel of judges (deferred)

The YAML scaffold and the script both already loop over --models. v2 adds a second model to the list and lets the workflow post two sets of inline reviews per PR. Open questions for when v2 is up:

Do we want a third "summariser" pass that reconciles findings across the two reviewers, or is two raw review sets fine?
Do we want to surface a combined neutral check verdict that aggregates all models' finding counts?

Defer until v1 has soaked successfully.

Where it lives

File	What
`.github/workflows/multi-llm-review.yml`	The workflow. Owns gating, environment, escape hatches.
`scripts/multi-llm-review.ts`	The reviewer script. Pure I/O at the edges, pure functions for unit-tested logic.
`scripts/multi-llm-review.test.ts`	Unit tests: arg parsing, spec extraction, budget tracker, prompt assembly, rendering, streaming, inline posting.
`.ai/specs/SPEC-inf-126-multi-llm-review-activation.md`	The activation spec.
`.ai/specs/SPEC-inf-155-multillm-stream-inline.md`	The streaming + inline comments spec.
`scripts/review-eval/`	Offline eval harness (INF-168). Scores any reviewer arm against a labelled benchmark; reports precision + recall.
`scripts/review-eval/benchmark/seed.json`	3 seed labelled cases from real historic Constellation defects.
`.ai/specs/SPEC-inf-168-review-eval-harness.md`	Spec for the offline eval harness (Phase 0 of INF-163).

Why a second opinion​

What it does​

How findings are posted​

How the idle-timeout model works (streaming)​

How it was enabled (one-time repo-admin step)​

How to skip a PR​

How the budget works​

Failure modes (mostly soft-fail)​

Bootstrap failures (exit 1)​

Soak protocol (ongoing quality signal)​

v2 sketch — panel of judges (deferred)​

Where it lives​