23. May 2026
The developer inner loop — edit, build, test, debug — is the highest-leverage piece of any engineering team’s workflow. It runs hundreds of times a day per developer. Shaving seconds off it compounds into hours; lengthening it by minutes kills momentum.
What we used to optimise was the human’s cycle time. Faster compilers, hot reload, snappier test runners. The human stayed in the seat throughout, dispatching each step and reading the output.
That assumption no longer holds. When a coding agent sits in the loop, the unit of iteration shifts. The agent runs its own inner loop inside the developer’s outer loop — planning, querying the codebase, editing, checking the result, correcting itself — and the developer only sees the final diff. The properties that matter for the inner loop have changed accordingly. This post is about what that new inner loop looks like and what you have to build to make it work in a real polyglot codebase.
In the classic loop, latency from action to feedback is what mattered. The developer initiated each step and parsed the output with eyes and a working memory.
In the agent-mediated loop, three things shift:
The agent’s cycle time is what we optimise. The developer sees minutes between prompts; the agent runs many self-correction iterations between those prompts. If each of those inner cycles is slow or noisy, the agent wastes the developer’s wall-clock time and tokens.
Feedback has to be machine-actionable, not just human-readable. A human can squint at a stack trace and infer the fix. An agent benefits from structured output with explicit hints. The single highest-leverage change you can make in this whole space is to turn your tool error messages into prompts — small structured payloads that tell the agent not just what’s wrong, but why the rule exists and how to fix it.
Generic tooling is wasteful. Asking an agent to find every caller of a function by grepping is precise the way a sledgehammer is precise. It loads a lot of irrelevant context into the agent’s working set, eats tokens, and gets things wrong on overloaded names. The same task takes one tool call against a code-intelligence index and returns exactly what’s needed.
These three shifts drive the architecture below. The agent needs a substrate it can ask precise questions of, sensors that give it actionable feedback, and codemods it can dispatch when it knows what change to make. Skills package the workflows that compose these pieces. The rest of this post is about each layer.
The substrate is the layer that lets the agent ask semantic questions about the code without parsing files itself. For a polyglot codebase this is non-negotiable — the agent needs one query language that works across Rust, TypeScript, Python, Go, and whatever else lives in the repo.
The current state of the art is built on three components:
Tree-sitter parsers provide a normalised AST per language. There are grammars for everything reasonable, and they all expose a consistent node API. An AST-backed MCP server (like symgraph) sits on top of these and exposes the queries an agent actually needs: find all callers of this function, show me every implementation of this interface, what would break if I changed this region of code. These queries are what the agent should reach for instead of grep.
SCIP (Source Code Intelligence Protocol) is the modern successor to LSIF. Per-language indexers — scip-java, scip-typescript, scip-python, scip-go, scip-rust — emit a common protobuf format with type-attributed symbol references. With SCIP in the substrate, “find references” works across language boundaries: a TypeScript call to a function defined in a Python service gets resolved correctly.
A baseline store records architecture-level metrics over time. Coupling, fan-in/fan-out, layering compliance, complexity per module. Most existing fitness function tools are stateless — they answer “is the codebase compliant right now” — but the more useful question for agent workflows is “did this PR make any of our metrics worse, and which ones got worse on purpose?” That diff is the most actionable signal you can give to a per-commit gate.
These three together form a uniform graph that any sensor or codemod can query. The architectural payoff is that adding a new fitness function or refactoring recipe means writing one query against the substrate, not one tool per language.
Birgitta Böckeler’s Sensors for Coding Agents provides the cleanest taxonomy I’ve seen for this layer. Sensors split along two axes: computational vs inferential (deterministic tools vs LLM-led analyses), and where they fire in the lifecycle (session, pre-commit, CI, scheduled, runtime).
These are the tools you already know — type checkers, linters, fitness function suites like ArchUnit and dependency-cruiser, SAST scanners like Semgrep, mutation testers, GitLeaks pre-commit hooks, OPA for infra policy. They run fast and produce deterministic verdicts.
For agents, the trick is what you do with their output. A standard ESLint message (“no-unused-vars”) is fine for a human; it’s a missed opportunity for an agent. Rewrite it:
{
"rule": "no-unused-vars",
"violation": "the variable 'session' is declared but never read",
"location": "src/auth/middleware.ts:42:9",
"rationale": "unused declarations in middleware often indicate dead error handling paths or forgotten wiring",
"fix_hint": "either remove the declaration or wire it into the response context",
"severity": "warn",
"suppressible": true,
"suppression_pattern": "// eslint-disable-next-line no-unused-vars -- <reason>"
}
That payload is a self-contained prompt. The agent doesn’t need to look up what the rule means or guess the fix — the message carries the architectural rationale and a concrete pattern for either complying or suppressing with a documented reason. The other lever the article calls out: make rules threshold-bumpable, not just suppressible. For continuous metrics like cyclomatic complexity, the agent should be able to raise the threshold by one with a comment — the constraint survives, the rule fires again if it gets worse, and the agent isn’t forced into pointless refactors when the alternative is a documented exception.
The interesting category. These are LLM-led reviews that run on a slower cadence and produce qualitative findings — modularity assessments, security reviews, data-handling audits, anti-pattern detection. Vlad Khononov’s Modularity Skills are an excellent reference for what this looks like in practice.
The non-obvious result from Böckeler’s experiments is that raw coupling metrics fed to an LLM produce worse findings than an LLM-led review that interprets the code itself. The deterministic data is most useful as grounding for the inferential pass, not as a direct signal. So the architecture is:
This pattern — deterministic substrate + inferential interpretation — is the highest-leverage thing you can do beyond basic fitness functions.
Different sensors belong at different points in the lifecycle. The article’s framing here is useful: session-level sensors give the agent feedback during the edit; pre-commit sensors catch quick problems before they hit CI; CI sensors run the expensive things; scheduled sensors detect drift in the slow time horizon; runtime sensors gate production.
The two new positions worth thinking about are scheduled inferential reviews and baseline drift. Most teams don’t run a weekly modularity audit or compare current architecture metrics against last month’s baseline. Adding those captures the slow-burn problems that no per-PR gate catches — the gradual accumulation of cross-cutting hacks, the silent decay of test quality, the slow drift away from the architecture you said you had.
When the agent knows what to change, it shouldn’t be doing the change with text editing. Codemod engines exist for this:
The pattern that emerges is: agent plans the refactor by querying the substrate, dispatches deterministic codemods to do the actual transform, then re-queries the sensors to verify the change didn’t drift the architecture. The LLM is doing orchestration; the codemod is doing the surgery. This loop is more reliable than letting the agent edit text directly.
Put the layers together and the ecosystem looks like this:
The bidirectional arrows to the sensor categories carry the load-bearing insight: sensors aren’t just checks, they’re the feedback channel that closes the agent’s inner loop. The one-way arrow to codemods reflects that codemods are dispatched for execution and don’t need to return rich guidance — the verification happens by re-running sensors after the transformation.
Everything sits on a substrate that exposes consistent semantic queries. That uniformity is what makes adding a new fitness function tractable in a polyglot repo: write the query once against the substrate, not once per language.
A skill is a folder with a SKILL.md file — YAML frontmatter and markdown instructions that an agent loads when relevant. It’s the unit of agent capability that gets shared across teams.
A minimal skill for the layered-architecture fitness function looks like this:
---
name: enforce-layered-architecture
description: |
Use when the user asks to verify, fix, or review layering of a Rust
service against the hexagonal architecture pattern. Queries symgraph
to detect domain-to-infrastructure imports and proposes inversions.
---
# Enforce layered architecture
## Inputs
- A repository with a `crates/` layout
- A `.architecture/layers.toml` declaring layer names and direction
## Procedure
1. Call symgraph-search to enumerate modules per layer
2. For each module, call symgraph-callees to identify outbound dependencies
3. Compare each edge against the declared direction
4. For violations, propose a trait-inversion fix using ast-grep recipes
## Output
For each violation: location, rationale, proposed inversion, severity.
Use the rich payload format (violation + rationale + fix_hint) so the
agent can self-correct.
That’s the skill. It’s a few hundred lines of instructions, references the substrate (symgraph), and points at a codemod engine (ast-grep). It’s authored once, called by name whenever it’s relevant.
Here’s the problem nobody talks about enough. A skill like the one above is specific. It’s a Rust skill that assumes hexagonal architecture, a crates/ layout, and ast-grep recipes that probably depend on Rust’s syntax. It will not help a Spring Boot service. It will not help a Django app. It will not help a Go monorepo.
The same applies to almost every useful skill:
cargo test, pytest, vitest, mvn test, go test).cargo, maven, gradle, npm, pnpm, bazel).The combinatorics get out of hand quickly. N languages × M frameworks × P build tools × Q architectural styles. A single team’s useful skill catalogue might be ten or twenty skills. A platform team supporting an organisation’s full polyglot estate is suddenly looking at hundreds.
Worse, skills have dependencies that they can’t currently declare. The hexagonal architecture skill above needs symgraph installed, ast-grep available, and Rust toolchain on the path. Today, the way users satisfy those dependencies is “read the skill’s prerequisites section and install everything by hand.” That doesn’t scale.
This is the moment in the maturation of a developer ecosystem when package managers, registries, and dependency declarations stop being optional and become structural.
Claude Code’s plugin system is one answer to this. The hierarchy is:
SKILL.md..claude-plugin/plugin.json — that bundles one or more skills, plus optional commands, hooks, sub-agents, and MCP server declarations..claude-plugin/marketplace.json listing plugins available for install. Users add the marketplace once and install plugins from it by name.This gets you a real distribution channel. A platform team publishes a marketplace; teams across the organisation install the plugins they need; updates propagate via a single marketplace update rather than N hand-edits across repos.
The structure for a personal or team-level marketplace is small enough to fit in your head:
your-marketplace/
├── .claude-plugin/
│ └── marketplace.json
├── plugins/
│ └── fitness-functions/
│ ├── .claude-plugin/
│ │ └── plugin.json
│ └── skills/
│ ├── layered-architecture/
│ │ └── SKILL.md
│ └── modularity-review/
│ └── SKILL.md
└── README.md
The marketplace.json catalogues plugins; each plugin’s plugin.json describes what’s in it; each skill’s SKILL.md describes when to use it. End users run /plugin marketplace add <owner>/<repo> once, then /plugin install <plugin>@<marketplace> per plugin.
This solves discovery, versioning, and updates. What it doesn’t solve directly is the dependency on external MCP servers — which is where the next piece comes in.
A skill that depends on symgraph or any other MCP server has historically had no way to declare that dependency machine-readably. The skill says “install symgraph first” in prose; the user reads the prose, fetches symgraph from its own release page, installs it, configures their MCP client, then runs the skill.
That manual chain is fine for one skill. It’s a wet sock when you’re trying to ship a marketplace of fifty.
The piece that completes the picture is a delivery tool for the MCP servers themselves — something analogous to npx but for compiled binaries. Call it bx. It does for native binaries what npx does for Node packages: spec the binary, fetch the right asset from a GitHub release for the current platform, cache it, exec it with stdio passthrough.
That last property is what makes it useful in MCP configs:
{
"mcpServers": {
"symgraph": {
"command": "bx",
"args": ["run", "grahambrooks/symgraph@^v2026.4", "--", "serve"]
}
}
}
The user installs bx once. Every MCP server is then a one-line config entry that handles fetching, caching, version pinning, and execution.
The truly interesting move is the skill-level integration. Skills declare their MCP server dependencies in frontmatter:
---
name: explore-code
description: Use when navigating an unfamiliar codebase.
mcp_servers:
- grahambrooks/symgraph@^v2026.4
- someorg/dep-cruiser@^v1
---
When the skill loads, the runner calls bx ensure --skill ./skill-dir/, which resolves and caches every declared dependency before the skill runs. Skills now compose with their dependencies attached. A marketplace becomes self-contained: install a plugin, every skill in it just works.
This is the package-manager moment for the skill ecosystem. Without it, every useful skill carries a “first, install these…” preamble. With it, skills are real composable units.
Concretely, here’s what an inner-loop iteration looks like with all of this wired up. A developer says “refactor the order placement flow to remove the direct database call from the domain layer”. The agent:
symgraph-search for the order placement module. Calls symgraph-callees on each function to find what it depends on. The substrate returns a structured graph slice, not a pile of text.If any sensor had fired during step 6, its rich payload — violation, rationale, fix hint — would let the agent loop back to step 4 and try again, without involving the developer. The developer only sees the final result.
The classic inner loop took fifteen minutes for the same refactor with two test cycles and a typo. The agent-mediated loop takes ninety seconds with the developer in the loop only at the prompt and the diff review.
If you want to build toward this, the order matters. Here’s what I’d prioritise:
First, the substrate. Stand up an AST MCP server for your codebase. symgraph is one option; you can also build your own around tree-sitter for the languages you actually use. Add a SCIP index. This gives you the query layer that everything else depends on. Without it, every skill ends up doing its own text parsing badly.
Second, the sensor payload format. Before writing more rules, fix the messages on the rules you have. Every tool output should carry violation, rationale, fix hint, severity, and suppression pattern in a structured shape. This is the single highest-leverage change for agent productivity.
Third, the baseline store. Even something as simple as committing a metrics.json per main-branch build, and diffing against it on PRs, gives you a per-PR drift signal that’s far more actionable than absolute thresholds.
Fourth, package what you have as a marketplace. Even if it’s just three or four skills, the discipline of writing them as a plugin with a manifest forces a quality threshold and makes them reusable beyond their first author.
Fifth, solve the binary delivery problem. Whether that’s bx, Homebrew taps, MCP-bundle files, or just well-written install scripts, get the friction of “install three binaries before the skill works” down to zero. This is the part that determines whether skills actually compose for users who didn’t author them.
The deeper architectural point is that we’re building the same kind of structural scaffolding for agents that we built for humans over the last three decades — package managers, registries, dependency graphs, semantic queries, observability. The components look different (skills instead of libraries, MCP servers instead of services, fitness functions instead of unit tests) but the patterns repeat. The teams that get there first are the ones who treat agent productivity as an engineering discipline, not a magic spell.