26. April 2026
The pull request was built for a world where humans wrote most of the code. We don’t live there anymore.
The pull request as a unit of review is a recent invention. GitHub shipped it in 2008. Before that, code review meant Fagan inspections, pair programming, or the homegrown pre-commit systems Google and Facebook ran internally. The PR didn’t just give us a UI — it institutionalized the change as the thing we look at. Eighteen years later, “review” and “review the diff” mean the same thing to almost everyone.
That equivalence is starting to creak.
When agents are writing the code, the assumptions baked into diff-centric review stop holding. It’s worth asking what review looks like when the substrate underneath it has shifted this much.
Diff review works because of priors most teams never have to say out loud:
Every one of those priors is a load-bearing assumption. None of them survive contact with an agent operating at full throttle.
Take the priors one at a time.
Intent doesn’t live in the diff. The agent’s reasoning lives in the prompt, the conversation, the tool calls it made along the way. By the time the patch lands in a PR, the why has been compressed into a commit message the agent wrote about its own work, after the fact. The PR description says “fix flaky test.” What it doesn’t say is that the agent fixed it by inserting a 500ms sleep, because that’s what eventually made the test green. The reasoning that arrived at the sleep — two failed attempts at synchronization, a misread of the race condition, a final shrug — is nowhere on the page.
flowchart LR
subgraph Trail["The agent's reasoning trail — invisible to the reviewer"]
P[Prompt] --> A1[read auth.ts]
A1 --> A2[read tests]
A2 --> A3[edit handler]
A3 --> A4[run test: fail]
A4 --> A5[edit retry logic]
A5 --> A6[run test: fail]
A6 --> A7[insert sleep 500ms]
A7 --> A8[run test: pass]
end
subgraph Visible["What the PR carries forward"]
CM["commit: 'fix flaky test'"]
end
A8 --> CM
CM --> R[Human reviewer]
Change is no longer small relative to state. An agent that writes a feature end-to-end produces hundreds of lines spanning files the human reviewer has never opened. A diff that touches the auth layer, three handlers, two migration files, and a config schema isn’t unusual; it’s the median. The “small delta against trusted state” model inverts: most of the state is now also agent-written, and trust transitively decays. You’re not reviewing a change against a known base anymore — you’re reviewing a change against another change, against another change.
Volume blows past human bandwidth. A single engineer running three agents in parallel can open twenty PRs by lunch. At five minutes of attention each, that’s a reviewer’s afternoon — for one author. Multiply by the team. Reviewer attention was already the bottleneck in 2020. Now it’s the wall, and the wall is shorter than the queue.
Plausibility decouples from correctness. This is the dangerous one. AI-generated code reads locally well — that’s what the model was trained to produce. The bugs aren’t typos a careful reviewer spots. They’re relational: a function that used to return Result<User, AuthError> now returns Option<User>; the callsite the agent updated handles it correctly; the other six callsites elsewhere silently dropped their error branches when the type changed. A retry policy that previously bounded at three attempts now defaults to unbounded because a constant moved files and lost its initializer. A migration that “looked safe” added a non-null column to a table the agent hadn’t read enough of to know was append-heavy in production. Line-by-line review is exactly the wrong shape for these failures. The diff that contains the bug looks fine. The bug is in the relationship the diff perturbs.
flowchart TB
subgraph Human["Human-authored bug — lives in the diff"]
H1[5 changed lines] --> H2[reviewer reads them]
H2 --> H3[bug is in those lines]
H3 --> H4[caught at review]
end
subgraph Agent["Agent-authored bug — lives in the relationships"]
A1[5 changed lines] --> A2[reviewer reads them]
A2 --> A3[diff looks fine]
A3 --> A4[bug at 6 unchanged callsites]
A4 --> A5[caught in production]
end
If the PR isn’t the right review unit, what is? A few of the models people are building toward:
The agent doesn’t open a PR alone. A second agent — different prompt, different tools, ideally different model — reviews the work before any human is paged. The human sees a curated subset, with the peer-review notes already attached.
In practice this looks like a reviewer-agent that reads the patch and asks pointed questions back at the authoring run: “This function silently catches and discards IOError where the previous version raised. Was that intentional?” “You added a new index but no migration for the existing 12M rows — was the table intended to be empty?” “This test asserts on a string that contains the current date. It will pass today and fail tomorrow.” The author-agent either justifies the choice in writing or revises. Only the resolved version reaches a human, and the conversation between the agents arrives attached.
This isn’t novel. It’s pair programming with the labor reallocated. The XP-era insight — that two pairs of eyes during authorship beat one pair of eyes after — applies even more cleanly when both pairs are cheap.
The risk is that two agents trained on similar distributions share the same blind spots. A reviewer-agent and an author-agent from the same model family will both confidently miss the same edge case, then sign off on each other’s confidence. Defense in depth requires different agents, not just two of them — different model families, different prompt strategies, ideally one constrained to a narrower role (a “type-checker reviewer” or a “security reviewer”) rather than a general critic.
Stop tying review to the change. Run it against the codebase on a schedule. The knowledge graph surfaces facts about the system independent of any single PR — cycles introduced this week, contracts that drifted, public surfaces that grew, hot files that gained complexity faster than test coverage.
The graph notices things diff review never could: that payments/ has, for the first time, started importing from analytics/. That the User table has gained four nullable columns over the last fortnight, none of which any caller reads. That median test runtime has crept up 12% since Tuesday and 80% of the new time is in three suites. That a function which once had two callers now has fifty-seven. None of these are bugs. All of them are signals — and none of them are visible from inside any single PR.
Under this model, the PR becomes a moment in a continuous review stream rather than the unit of review itself. A diff that adds a cycle isn’t blocked because it’s a bad diff — it’s blocked because the system now contains a cycle it didn’t before, and the diff is the proximate cause. The reviewer’s question shifts from “is this change good?” to “is the system, including this change, in worse shape than yesterday?”
This is closer to monitoring than to review. Which may be the point: when changes outpace reviewer bandwidth, you don’t review faster, you observe. Engineering organizations already accepted this trade for production behavior — nobody reads every log line; we set alerts and look at dashboards. The same shape applies to code.
If the agent reliably implements specifications, the spec is the artifact worth reviewing. Approve a clear, testable specification; let the agent produce the implementation; verify the implementation against the spec automatically. The diff becomes incidental.
Concretely: the spec for a payment retry handler says “idempotent on retry, safe under concurrent calls with the same idempotency key, surfaces gateway errors without retry on 4xx, retries with exponential backoff on 5xx up to a configured ceiling.” The reviewer’s argument is whether those properties are the right properties — whether the ceiling should be configurable per-merchant, whether 429 should be its own case, whether idempotency should extend to partial failures. The implementation is downstream of that argument. A property test suite validates the implementation against the spec on every change; the spec is what humans actually read.
This shifts the human review surface from code to intent. It’s the same move TLA+ asked us to make more than two decades ago, with a different forcing function. Spec quality becomes the discipline that matters; implementation quality becomes the agent’s problem.
The catch is that most teams don’t have specs precise enough to be authoritative. Most “specs” today are Notion docs that hand-wave the hard parts. The AI-generation era may finally be the thing that pushes teams to write specs that bind — because the alternative is reviewing diffs that nobody can keep up with.
| Model | Unit of review | Best at catching | Where it breaks |
|---|---|---|---|
| Pre-PR agent peer review | The patch, before humans see it | Local errors and contract changes the author dismissed | Shared blind spots across same-family agents |
| Continuous state review | The whole codebase, on a schedule | Cross-PR drift: cycles, surface growth, coverage gaps | Loose coupling to any single change |
| Spec-as-review-unit | The specification itself | Spec/implementation divergence, intent errors | Most teams lack specs precise enough to bind |
None of these wins outright. The PR survives, but as one signal among several:
flowchart LR
A[Author Agent
writes patch] --> B[Reviewer Agent
different model]
B -->|finds issues| A
B -->|approved| C[Spec Verification
property tests]
C --> D[Graph Delta
cycles, surface, coverage]
D --> E{Sampled
for human?}
E -->|yes| F[Human Audit]
E -->|no| G[Merge]
F --> G
G --> H[Behavior Diff
shadow traffic]
H -->|regression| I[Rollback / page]
H -->|clean| J[Promote]
Each stage catches a different failure mode; none of them tries to catch all of them. The diff still exists — it’s just no longer where the work happens.
The PR page on GitHub is going to look increasingly like an artifact of an earlier era — a screen humans visit to rubber-stamp work that’s already been reviewed by other agents and validated by continuous checks. The actual review surface will be elsewhere.
The implications are uncomfortable for the current tooling stack. Most of it — GitHub, GitLab, every PR-comment bot — is built around the diff as the unit of work. The next generation looks different:
None of this looks like the GitHub PR page. It probably shouldn’t.
Code review is one of the few engineering rituals that survived almost unchanged from the era when humans wrote everything. The pull request became the universal answer because, for a long time, the universal question was “is this change, made by this person, safe to merge?”
That question still gets asked. It’s just no longer the most important one. The more important question — is the system, with this change in it, still in good shape? — was always harder, and we’ve been answering it indirectly through PR review for a long time. Indirect was tolerable when humans were the bottleneck. It isn’t anymore.
The pull request was built for humans. The review primitive needs to catch up to who’s actually writing the code.
A few predictions worth committing to, however imperfectly:
The review surface fragments. Today, “review” is a single button on a single page. In five years, it’s a pipeline with named stages: agent peer review at authorship, property verification against spec, graph delta against the system, sampled human audit, and post-merge behavior diffing in a staging environment that runs the new code against yesterday’s traffic. Different stages catch different failure modes; no stage tries to catch all of them. The PR page, if it still exists, is a thin UI over that pipeline rather than the place where review happens.
Human attention moves up the stack. Engineers spend less time reading patches and more time reading specs, graph diffs, and post-incident logs. The skill that gets rare and expensive isn’t writing code or even reading it — it’s articulating invariants the system must maintain, and noticing when a graph delta means one of them is quietly slipping. “Senior engineer” comes to mean “the person who can tell you which signals are worth watching this quarter,” not “the person who reviews the most PRs.”
New roles appear, awkwardly named. Someone owns the review pipeline itself — tunes the agents, watches their false positive rate, decides which classes of finding escalate to humans. Call them a review engineer, an audit lead, a systems steward; the title will lag the work by a decade. Compliance and security will get there first, because they already think this way.
The default merge story changes. Today, merge means “approved by a human.” Tomorrow, merge means “passed a battery of automated checks, sampled into a human review stream, and observable post-deploy via behavior diff.” Human approval becomes a sample, not a gate. This will feel wrong to engineers trained on the current ritual, in the same way that “deploy without a release manager” felt wrong in 2010.
The tools that win look nothing like GitHub. GitHub will add an AI review tab; so will GitLab. That’s not the interesting move. The interesting move is the tool whose first abstraction is the system graph, whose second is the spec, and whose third — almost as an afterthought — is the diff. That tool may come from an incumbent pivoting hard, but more likely it comes from a team building greenfield, with no PR-shaped legacy to drag forward. Whoever ships it will look, in retrospect, like the team that shipped the pull request itself in 2008: solving a problem the previous generation of tools couldn’t see clearly because the substrate hadn’t shifted yet.
| Today | Tomorrow |
|---|---|
| Review = read the diff | Review = a pipeline of differently-shaped checks |
| One human approves before merge | Sampling decides which merges a human ever sees |
| The PR page is the review surface | The PR page is a thin UI over a pipeline that ran elsewhere |
| Senior engineer = reviews the most PRs | Senior engineer = articulates invariants and reads graph deltas |
| Tools center the diff | Tools center the system graph and the spec |
| Approval is the gate | Behavior diff in shadow traffic is the gate |
The pull request was a good answer to a question we no longer ask the same way. The next primitive isn’t a better PR. It’s a different question.