Code Review After the Pull Request

The pull request was built for a world where humans wrote most of the code. We don’t live there anymore.

The pull request as a unit of review is a recent invention. GitHub shipped it in 2008. Before that, code review meant Fagan inspections, pair programming, or the homegrown pre-commit systems Google and Facebook ran internally. The PR didn’t just give us a UI — it institutionalized the change as the thing we look at. Eighteen years later, “review” and “review the diff” mean the same thing to almost everyone.

That equivalence is starting to creak.

When agents are writing the code, the assumptions baked into diff-centric review stop holding. It’s worth asking what review looks like when the substrate underneath it has shifted this much.

Diff review works because of priors most teams never have to say out loud:

  • Authors are humans with intent. The patch carries reasoning the reviewer can reconstruct.
  • Change is small relative to state. A typical PR touches a fraction of a percent of the codebase.
  • The surrounding code was once reviewed. You’re inspecting a delta against a trusted base.
  • Volume is bounded by typing speed. A team produces a manageable number of PRs per day.
  • Plausibility correlates with correctness. Code that reads well usually behaves well, because the human who wrote it had to think it through to type it.

Every one of those priors is a load-bearing assumption. None of them survive contact with an agent operating at full throttle.

Take the priors one at a time.

Intent doesn’t live in the diff. The agent’s reasoning lives in the prompt, the conversation, the tool calls it made along the way. By the time the patch lands in a PR, the why has been compressed into a commit message the agent wrote about its own work, after the fact. The PR description says “fix flaky test.” What it doesn’t say is that the agent fixed it by inserting a 500ms sleep, because that’s what eventually made the test green. The reasoning that arrived at the sleep — two failed attempts at synchronization, a misread of the race condition, a final shrug — is nowhere on the page.

flowchart LR
    subgraph Trail["The agent's reasoning trail — invisible to the reviewer"]
        P[Prompt] --> A1[read auth.ts]
        A1 --> A2[read tests]
        A2 --> A3[edit handler]
        A3 --> A4[run test: fail]
        A4 --> A5[edit retry logic]
        A5 --> A6[run test: fail]
        A6 --> A7[insert sleep 500ms]
        A7 --> A8[run test: pass]
    end
    subgraph Visible["What the PR carries forward"]
        CM["commit: 'fix flaky test'"]
    end
    A8 --> CM
    CM --> R[Human reviewer]

Change is no longer small relative to state. An agent that writes a feature end-to-end produces hundreds of lines spanning files the human reviewer has never opened. A diff that touches the auth layer, three handlers, two migration files, and a config schema isn’t unusual; it’s the median. The “small delta against trusted state” model inverts: most of the state is now also agent-written, and trust transitively decays. You’re not reviewing a change against a known base anymore — you’re reviewing a change against another change, against another change.

Volume blows past human bandwidth. A single engineer running three agents in parallel can open twenty PRs by lunch. At five minutes of attention each, that’s a reviewer’s afternoon — for one author. Multiply by the team. Reviewer attention was already the bottleneck in 2020. Now it’s the wall, and the wall is shorter than the queue.

Plausibility decouples from correctness. This is the dangerous one. AI-generated code reads locally well — that’s what the model was trained to produce. The bugs aren’t typos a careful reviewer spots. They’re relational: a function that used to return Result<User, AuthError> now returns Option<User>; the callsite the agent updated handles it correctly; the other six callsites elsewhere silently dropped their error branches when the type changed. A retry policy that previously bounded at three attempts now defaults to unbounded because a constant moved files and lost its initializer. A migration that “looked safe” added a non-null column to a table the agent hadn’t read enough of to know was append-heavy in production. Line-by-line review is exactly the wrong shape for these failures. The diff that contains the bug looks fine. The bug is in the relationship the diff perturbs.

flowchart TB
    subgraph Human["Human-authored bug — lives in the diff"]
        H1[5 changed lines] --> H2[reviewer reads them]
        H2 --> H3[bug is in those lines]
        H3 --> H4[caught at review]
    end
    subgraph Agent["Agent-authored bug — lives in the relationships"]
        A1[5 changed lines] --> A2[reviewer reads them]
        A2 --> A3[diff looks fine]
        A3 --> A4[bug at 6 unchanged callsites]
        A4 --> A5[caught in production]
    end
The shape of the failure has changed. Diff review optimizes for catching the author’s mistake at the moment of integration. The agent’s failure mode isn’t a mistake at the moment of integration — it’s a misunderstanding of the system that the patch happens to expose. A reviewer staring at a clean-looking diff is looking in the wrong place.

If the PR isn’t the right review unit, what is? A few of the models people are building toward:

The agent doesn’t open a PR alone. A second agent — different prompt, different tools, ideally different model — reviews the work before any human is paged. The human sees a curated subset, with the peer-review notes already attached.

In practice this looks like a reviewer-agent that reads the patch and asks pointed questions back at the authoring run: “This function silently catches and discards IOError where the previous version raised. Was that intentional?” “You added a new index but no migration for the existing 12M rows — was the table intended to be empty?” “This test asserts on a string that contains the current date. It will pass today and fail tomorrow.” The author-agent either justifies the choice in writing or revises. Only the resolved version reaches a human, and the conversation between the agents arrives attached.

This isn’t novel. It’s pair programming with the labor reallocated. The XP-era insight — that two pairs of eyes during authorship beat one pair of eyes after — applies even more cleanly when both pairs are cheap.

The risk is that two agents trained on similar distributions share the same blind spots. A reviewer-agent and an author-agent from the same model family will both confidently miss the same edge case, then sign off on each other’s confidence. Defense in depth requires different agents, not just two of them — different model families, different prompt strategies, ideally one constrained to a narrower role (a “type-checker reviewer” or a “security reviewer”) rather than a general critic.

Stop tying review to the change. Run it against the codebase on a schedule. The knowledge graph surfaces facts about the system independent of any single PR — cycles introduced this week, contracts that drifted, public surfaces that grew, hot files that gained complexity faster than test coverage.

The graph notices things diff review never could: that payments/ has, for the first time, started importing from analytics/. That the User table has gained four nullable columns over the last fortnight, none of which any caller reads. That median test runtime has crept up 12% since Tuesday and 80% of the new time is in three suites. That a function which once had two callers now has fifty-seven. None of these are bugs. All of them are signals — and none of them are visible from inside any single PR.

Under this model, the PR becomes a moment in a continuous review stream rather than the unit of review itself. A diff that adds a cycle isn’t blocked because it’s a bad diff — it’s blocked because the system now contains a cycle it didn’t before, and the diff is the proximate cause. The reviewer’s question shifts from “is this change good?” to “is the system, including this change, in worse shape than yesterday?”

This is closer to monitoring than to review. Which may be the point: when changes outpace reviewer bandwidth, you don’t review faster, you observe. Engineering organizations already accepted this trade for production behavior — nobody reads every log line; we set alerts and look at dashboards. The same shape applies to code.

If the agent reliably implements specifications, the spec is the artifact worth reviewing. Approve a clear, testable specification; let the agent produce the implementation; verify the implementation against the spec automatically. The diff becomes incidental.

Concretely: the spec for a payment retry handler says “idempotent on retry, safe under concurrent calls with the same idempotency key, surfaces gateway errors without retry on 4xx, retries with exponential backoff on 5xx up to a configured ceiling.” The reviewer’s argument is whether those properties are the right properties — whether the ceiling should be configurable per-merchant, whether 429 should be its own case, whether idempotency should extend to partial failures. The implementation is downstream of that argument. A property test suite validates the implementation against the spec on every change; the spec is what humans actually read.

This shifts the human review surface from code to intent. It’s the same move TLA+ asked us to make more than two decades ago, with a different forcing function. Spec quality becomes the discipline that matters; implementation quality becomes the agent’s problem.

The catch is that most teams don’t have specs precise enough to be authoritative. Most “specs” today are Notion docs that hand-wave the hard parts. The AI-generation era may finally be the thing that pushes teams to write specs that bind — because the alternative is reviewing diffs that nobody can keep up with.

Model Unit of review Best at catching Where it breaks
Pre-PR agent peer review The patch, before humans see it Local errors and contract changes the author dismissed Shared blind spots across same-family agents
Continuous state review The whole codebase, on a schedule Cross-PR drift: cycles, surface growth, coverage gaps Loose coupling to any single change
Spec-as-review-unit The specification itself Spec/implementation divergence, intent errors Most teams lack specs precise enough to bind

None of these wins outright. The PR survives, but as one signal among several:

  • Agents review agents at authorship time. Most patches never reach a human.
  • The graph reviews the system continuously. Architectural drift gets caught between PRs, not within them.
  • Humans review specs and architecture on a longer cadence. Patches get sampled, not read.
  • Audit replaces review for the long tail. The agent maintains a log; humans inspect the log when something breaks.
flowchart LR
    A[Author Agent
writes patch] --> B[Reviewer Agent
different model] B -->|finds issues| A B -->|approved| C[Spec Verification
property tests] C --> D[Graph Delta
cycles, surface, coverage] D --> E{Sampled
for human?} E -->|yes| F[Human Audit] E -->|no| G[Merge] F --> G G --> H[Behavior Diff
shadow traffic] H -->|regression| I[Rollback / page] H -->|clean| J[Promote]

Each stage catches a different failure mode; none of them tries to catch all of them. The diff still exists — it’s just no longer where the work happens.

The PR page on GitHub is going to look increasingly like an artifact of an earlier era — a screen humans visit to rubber-stamp work that’s already been reviewed by other agents and validated by continuous checks. The actual review surface will be elsewhere.

The implications are uncomfortable for the current tooling stack. Most of it — GitHub, GitLab, every PR-comment bot — is built around the diff as the unit of work. The next generation looks different:

  • Spec-first authoring that makes the agent’s intent reviewable before any code exists.
  • Graph-based continuous review that runs against the codebase, not the change.
  • Agent-on-agent review pipelines with deliberately diverse models and prompts.
  • Sampling-based human audit that tells reviewers which PRs deserve their attention rather than asking them to triage all of them.
  • Behavior diffs, not source diffs — review what the system does differently, not what the source looks like differently.

None of this looks like the GitHub PR page. It probably shouldn’t.

Code review is one of the few engineering rituals that survived almost unchanged from the era when humans wrote everything. The pull request became the universal answer because, for a long time, the universal question was “is this change, made by this person, safe to merge?”

That question still gets asked. It’s just no longer the most important one. The more important question — is the system, with this change in it, still in good shape? — was always harder, and we’ve been answering it indirectly through PR review for a long time. Indirect was tolerable when humans were the bottleneck. It isn’t anymore.

The pull request was built for humans. The review primitive needs to catch up to who’s actually writing the code.

A few predictions worth committing to, however imperfectly:

The review surface fragments. Today, “review” is a single button on a single page. In five years, it’s a pipeline with named stages: agent peer review at authorship, property verification against spec, graph delta against the system, sampled human audit, and post-merge behavior diffing in a staging environment that runs the new code against yesterday’s traffic. Different stages catch different failure modes; no stage tries to catch all of them. The PR page, if it still exists, is a thin UI over that pipeline rather than the place where review happens.

Human attention moves up the stack. Engineers spend less time reading patches and more time reading specs, graph diffs, and post-incident logs. The skill that gets rare and expensive isn’t writing code or even reading it — it’s articulating invariants the system must maintain, and noticing when a graph delta means one of them is quietly slipping. “Senior engineer” comes to mean “the person who can tell you which signals are worth watching this quarter,” not “the person who reviews the most PRs.”

New roles appear, awkwardly named. Someone owns the review pipeline itself — tunes the agents, watches their false positive rate, decides which classes of finding escalate to humans. Call them a review engineer, an audit lead, a systems steward; the title will lag the work by a decade. Compliance and security will get there first, because they already think this way.

The default merge story changes. Today, merge means “approved by a human.” Tomorrow, merge means “passed a battery of automated checks, sampled into a human review stream, and observable post-deploy via behavior diff.” Human approval becomes a sample, not a gate. This will feel wrong to engineers trained on the current ritual, in the same way that “deploy without a release manager” felt wrong in 2010.

The tools that win look nothing like GitHub. GitHub will add an AI review tab; so will GitLab. That’s not the interesting move. The interesting move is the tool whose first abstraction is the system graph, whose second is the spec, and whose third — almost as an afterthought — is the diff. That tool may come from an incumbent pivoting hard, but more likely it comes from a team building greenfield, with no PR-shaped legacy to drag forward. Whoever ships it will look, in retrospect, like the team that shipped the pull request itself in 2008: solving a problem the previous generation of tools couldn’t see clearly because the substrate hadn’t shifted yet.

Today Tomorrow
Review = read the diff Review = a pipeline of differently-shaped checks
One human approves before merge Sampling decides which merges a human ever sees
The PR page is the review surface The PR page is a thin UI over a pipeline that ran elsewhere
Senior engineer = reviews the most PRs Senior engineer = articulates invariants and reads graph deltas
Tools center the diff Tools center the system graph and the spec
Approval is the gate Behavior diff in shadow traffic is the gate

The pull request was a good answer to a question we no longer ask the same way. The next primitive isn’t a better PR. It’s a different question.