The Agentic Inner Loop: Sensors, Substrate, and Skills for Polyglot Development

The developer inner loop — edit, build, test, debug — is the highest-leverage piece of any engineering team’s workflow. It runs hundreds of times a day per developer. Shaving seconds off it compounds into hours; lengthening it by minutes kills momentum.

What we used to optimise was the human’s cycle time. Faster compilers, hot reload, snappier test runners. The human stayed in the seat throughout, dispatching each step and reading the output.

That assumption no longer holds. When a coding agent sits in the loop, the unit of iteration shifts. The agent runs its own inner loop inside the developer’s outer loop — planning, querying the codebase, editing, checking the result, correcting itself — and the developer only sees the final diff. The properties that matter for the inner loop have changed accordingly. This post is about what that new inner loop looks like and what you have to build to make it work in a real polyglot codebase.

Classic vs agent-mediated inner loop Two cycles side by side. Left: the classic developer inner loop with Edit, Build, Test, Debug stages. Right: the agent-mediated loop with Prompt, Plan, Query, Edit, Verify stages and an inner self-correction sub-loop. Classic inner loop Developer drives every step. Minutes per cycle. Agent-mediated inner loop Agent drives inner steps. Seconds per inner iteration. Edit Build Test Debug Prompt developer states intent Verify tests pass Review developer inspects diff Plan agent decomposes Query substrate Edit Sensors dashed loop runs many times before review

ARCHITECTURE Agentic Polyglot Development AST substrate · sensor architecture · self-correcting LLM agents LLM Coding Agent Plans · Queries the substrate · Calls fix tools · Self-corrects on sensor feedback CODEMOD ENGINES OpenRewrite (LSTs) ast-grep · jscodeshift ts-morph · GritQL deterministic fixes queries · plans self-correction guidance violation + rationale + fix hint COMPUTATIONAL SENSORS deterministic · fast · cheap to run repeatedly Layer / Dependency Rules ArchUnit · dep-cruiser · OPA Coupling / Structure Metrics DSM · fan-in/out · cycles Mutation Testing Stryker · PIT · mutmut Lint + Agent-Tuned Rules ESLint · Semgrep · Factory-AI Pre-commit Sensors GitLeaks · type-check · format Type / Compile Checks tsc · mypy · go vet · scip Architectural Fitness Functions naming · API surface · cross-cutting (auth/logging/PII) cyclomatic complexity · file/function size · arg count INFERENTIAL SENSORS LLM-led · semantic · run on slower cadence Modularity Skills Khononov-style design review Security / AppSec Review checklist-prompted audits Data-Handling Review PII flow · logging policy Dependency Freshness deprecations · upgrade paths Duplication / Drift Audit "garbage collection" reviews PR Impact Interpretation blast radius → reviewer focus Hybrid Pattern deterministic CLI output → LLM grounds analysis in the data avoids hallucination · cheaper than full repo browsing AST MCP Server + Graph Substrate your work — unified polyglot query model exposed as MCP tools Structural queries Layer / dependency rules Diff-scoped analysis Blast-radius / callers Threshold + baseline drift Suppression log Cross-cutting audits Error responses = self-correction guidance Architectural fitness API capabilities exposed to the agent — every response carries the architectural rationale, not just the violation INDEXING SUBSTRATE parsers and code-intelligence indexes — the polyglot translation layer tree-sitter SCIP (scip-java/ts/python/go) LSP servers OpenRewrite LSTs Dependency manifests Git history / blame POLYGLOT CODEBASE source across multiple languages and config formats Java TypeScript Go Python Rust C# Kotlin Scala Ruby … and IaC · YAML · SQL · proto apply fixes SENSOR CADENCE where each sensor fires During session in-IDE, sub-second type-check, lint, MCP queries Pre-commit hooks last fast gate before push GitLeaks · quick fitness CI / PR gate clean-infra confirmation full suite · mutation testing Scheduled (drift) "garbage collection" cadence modularity skills · dep freshness · AI audits Production runtime OPA · traces · SLO-based fitness signals KEY PRINCIPLES Self-correction guidance in every error response Threshold bump, not binary suppress Hybrid: deterministic data grounds LLM analysis Diff-scoped beats whole-repo for per-PR signal Watch feedback overload avoid refactor spirals Suppression log → review starting point for humans Computational sets the floor; inferential the ceiling AST normalizes the polyglot query model Plan with MCP · apply with deterministic codemods SOURCES SYNTHESIZED: Evolutionary Architecture (Ford et al.) · Maintainability Sensors for Coding Agents (Böckeler / Thoughtworks) Modularity Skills (Khononov) · OpenRewrite LSTs (Moderne) · tree-sitter · SCIP code intelligence LEGEND read / query flow feedback / guidance fix application

In the classic loop, latency from action to feedback is what mattered. The developer initiated each step and parsed the output with eyes and a working memory.

In the agent-mediated loop, three things shift:

The agent’s cycle time is what we optimise. The developer sees minutes between prompts; the agent runs many self-correction iterations between those prompts. If each of those inner cycles is slow or noisy, the agent wastes the developer’s wall-clock time and tokens.

Feedback has to be machine-actionable, not just human-readable. A human can squint at a stack trace and infer the fix. An agent benefits from structured output with explicit hints. The single highest-leverage change you can make in this whole space is to turn your tool error messages into prompts — small structured payloads that tell the agent not just what’s wrong, but why the rule exists and how to fix it.

Generic tooling is wasteful. Asking an agent to find every caller of a function by grepping is precise the way a sledgehammer is precise. It loads a lot of irrelevant context into the agent’s working set, eats tokens, and gets things wrong on overloaded names. The same task takes one tool call against a code-intelligence index and returns exactly what’s needed.

These three shifts drive the architecture below. The agent needs a substrate it can ask precise questions of, sensors that give it actionable feedback, and codemods it can dispatch when it knows what change to make. Skills package the workflows that compose these pieces. The rest of this post is about each layer.

The substrate is the layer that lets the agent ask semantic questions about the code without parsing files itself. For a polyglot codebase this is non-negotiable — the agent needs one query language that works across Rust, TypeScript, Python, Go, and whatever else lives in the repo.

The current state of the art is built on three components:

Tree-sitter parsers provide a normalised AST per language. There are grammars for everything reasonable, and they all expose a consistent node API. An AST-backed MCP server (like symgraph) sits on top of these and exposes the queries an agent actually needs: find all callers of this function, show me every implementation of this interface, what would break if I changed this region of code. These queries are what the agent should reach for instead of grep.

SCIP (Source Code Intelligence Protocol) is the modern successor to LSIF. Per-language indexers — scip-java, scip-typescript, scip-python, scip-go, scip-rust — emit a common protobuf format with type-attributed symbol references. With SCIP in the substrate, “find references” works across language boundaries: a TypeScript call to a function defined in a Python service gets resolved correctly.

A baseline store records architecture-level metrics over time. Coupling, fan-in/fan-out, layering compliance, complexity per module. Most existing fitness function tools are stateless — they answer “is the codebase compliant right now” — but the more useful question for agent workflows is “did this PR make any of our metrics worse, and which ones got worse on purpose?” That diff is the most actionable signal you can give to a per-commit gate.

These three together form a uniform graph that any sensor or codemod can query. The architectural payoff is that adding a new fitness function or refactoring recipe means writing one query against the substrate, not one tool per language.

Birgitta Böckeler’s Sensors for Coding Agents provides the cleanest taxonomy I’ve seen for this layer. Sensors split along two axes: computational vs inferential (deterministic tools vs LLM-led analyses), and where they fire in the lifecycle (session, pre-commit, CI, scheduled, runtime).

These are the tools you already know — type checkers, linters, fitness function suites like ArchUnit and dependency-cruiser, SAST scanners like Semgrep, mutation testers, GitLeaks pre-commit hooks, OPA for infra policy. They run fast and produce deterministic verdicts.

For agents, the trick is what you do with their output. A standard ESLint message (“no-unused-vars”) is fine for a human; it’s a missed opportunity for an agent. Rewrite it:

{
  "rule": "no-unused-vars",
  "violation": "the variable 'session' is declared but never read",
  "location": "src/auth/middleware.ts:42:9",
  "rationale": "unused declarations in middleware often indicate dead error handling paths or forgotten wiring",
  "fix_hint": "either remove the declaration or wire it into the response context",
  "severity": "warn",
  "suppressible": true,
  "suppression_pattern": "// eslint-disable-next-line no-unused-vars -- <reason>"
}

That payload is a self-contained prompt. The agent doesn’t need to look up what the rule means or guess the fix — the message carries the architectural rationale and a concrete pattern for either complying or suppressing with a documented reason. The other lever the article calls out: make rules threshold-bumpable, not just suppressible. For continuous metrics like cyclomatic complexity, the agent should be able to raise the threshold by one with a comment — the constraint survives, the rule fires again if it gets worse, and the agent isn’t forced into pointless refactors when the alternative is a documented exception.

The interesting category. These are LLM-led reviews that run on a slower cadence and produce qualitative findings — modularity assessments, security reviews, data-handling audits, anti-pattern detection. Vlad Khononov’s Modularity Skills are an excellent reference for what this looks like in practice.

The non-obvious result from Böckeler’s experiments is that raw coupling metrics fed to an LLM produce worse findings than an LLM-led review that interprets the code itself. The deterministic data is most useful as grounding for the inferential pass, not as a direct signal. So the architecture is:

  1. A computational sensor extracts structured data (coupling matrices, fan-in/fan-out, dependency lists)
  2. An inferential skill is invoked with both the code and that data as context
  3. The skill produces findings grounded in both

This pattern — deterministic substrate + inferential interpretation — is the highest-leverage thing you can do beyond basic fitness functions.

The self-correction feedback loop A three-node cycle. Agent writes code, sensor evaluates and returns structured feedback containing violation, rationale, and fix hint, agent self-corrects, repeat. The sensor's response payload is shown as a callout. Agent writes code edits file, saves Sensor evaluates rule fires, builds payload writes feedback self-correct Sensor response payload { "violation": "domain layer imports infrastructure", "location": "src/order/place_order.rs:14", "rationale": "hexagonal architecture forbids inward dependencies from domain to adapters", "fix_hint": "inject the repository through a trait defined in the domain layer", "severity": "block", "suppressible": true, "docs": "./fitness/hexagonal.md#layering" } The error message IS the prompt: rationale and fix hint let the agent correct without another roundtrip.

Different sensors belong at different points in the lifecycle. The article’s framing here is useful: session-level sensors give the agent feedback during the edit; pre-commit sensors catch quick problems before they hit CI; CI sensors run the expensive things; scheduled sensors detect drift in the slow time horizon; runtime sensors gate production.

Sensor cadence across the development lifecycle Five horizontal swimlanes for different cadences — session (live), pre-commit, CI, scheduled drift, and runtime gates — each labelled with the sensors that typically run there. Session live in editor Pre-commit CI on push or PR Scheduled cron, drift detection Runtime production gates Type checker ESLint dependency-cruiser AST MCP query Semgrep Test (changed) ArchUnit Fitness functions GitLeaks Fast fitness checks Formatter / lint Conventional commits Full test suite Mutation testing Full fitness suite SAST scan Coverage PR impact analysis Baseline diff Build artifact Modularity review Security review Data handling audit Dependency refresh Anti-pattern scan Drift baseline diff Test quality audit OPA admission Perf budget alerts Synthetic checks Error budget gates Computational Inferential (LLM-led) Runtime gate

The two new positions worth thinking about are scheduled inferential reviews and baseline drift. Most teams don’t run a weekly modularity audit or compare current architecture metrics against last month’s baseline. Adding those captures the slow-burn problems that no per-PR gate catches — the gradual accumulation of cross-cutting hacks, the silent decay of test quality, the slow drift away from the architecture you said you had.

When the agent knows what to change, it shouldn’t be doing the change with text editing. Codemod engines exist for this:

  • OpenRewrite uses Lossless Semantic Trees (LSTs) that preserve type information and formatting. Several thousand recipes are available off the shelf for Java, Python, Terraform, Kubernetes manifests, and more. Recipes can serve as both fitness function and auto-fix in one shot.
  • ast-grep is the polyglot tree-sitter-based equivalent. Pattern-based search and rewrite across twenty-plus languages, with YAML rules that an LLM can author.
  • jscodeshift and ts-morph are the JavaScript/TypeScript-specific choices for type-aware refactoring.
  • GritQL is a declarative cross-language pattern language similar in spirit to semgrep but oriented toward rewrites.

The pattern that emerges is: agent plans the refactor by querying the substrate, dispatches deterministic codemods to do the actual transform, then re-queries the sensors to verify the change didn’t drift the architecture. The LLM is doing orchestration; the codemod is doing the surgery. This loop is more reliable than letting the agent edit text directly.

Put the layers together and the ecosystem looks like this:

The agentic polyglot development ecosystem Layered architecture. A coding agent connects bidirectionally to computational and inferential sensor categories and one-way to codemod engines. All three sit above a substrate of AST MCP server, code intelligence indexes, and a baseline metric store. Coding agent plans, queries, edits Computational live in session and CI Type checker ESLint plus plugins Semgrep, SAST dependency-cruiser ArchUnit, NetArchTest Mutation testing GitLeaks (pre-commit) OPA (infra policy) Test coverage Inferential scheduled drift checks Modularity skills Security review Data handling review Dependency freshness PR impact analysis Cross-cutting audits Anti-pattern detection Architecture drift Codemod engines invoked by agent OpenRewrite (LSTs) ast-grep (tree-sitter) jscodeshift, ts-morph GritQL patterns Hypermod, Codemod CLI IDE refactor APIs Moderne platform AST MCP server your tool Tree-sitter parsers Cross-language queries Self-correction messages Code intelligence semantic substrate SCIP indexes scip-java, scip-ts Optional Neo4j graph Baseline store drift detection Metric history Per-PR diff tracking Suppression telemetry

The bidirectional arrows to the sensor categories carry the load-bearing insight: sensors aren’t just checks, they’re the feedback channel that closes the agent’s inner loop. The one-way arrow to codemods reflects that codemods are dispatched for execution and don’t need to return rich guidance — the verification happens by re-running sensors after the transformation.

Everything sits on a substrate that exposes consistent semantic queries. That uniformity is what makes adding a new fitness function tractable in a polyglot repo: write the query once against the substrate, not once per language.

A skill is a folder with a SKILL.md file — YAML frontmatter and markdown instructions that an agent loads when relevant. It’s the unit of agent capability that gets shared across teams.

A minimal skill for the layered-architecture fitness function looks like this:

---
name: enforce-layered-architecture
description: |
  Use when the user asks to verify, fix, or review layering of a Rust
  service against the hexagonal architecture pattern. Queries symgraph
  to detect domain-to-infrastructure imports and proposes inversions.
---

# Enforce layered architecture

## Inputs
- A repository with a `crates/` layout
- A `.architecture/layers.toml` declaring layer names and direction

## Procedure
1. Call symgraph-search to enumerate modules per layer
2. For each module, call symgraph-callees to identify outbound dependencies
3. Compare each edge against the declared direction
4. For violations, propose a trait-inversion fix using ast-grep recipes

## Output
For each violation: location, rationale, proposed inversion, severity.
Use the rich payload format (violation + rationale + fix_hint) so the
agent can self-correct.

That’s the skill. It’s a few hundred lines of instructions, references the substrate (symgraph), and points at a codemod engine (ast-grep). It’s authored once, called by name whenever it’s relevant.

Here’s the problem nobody talks about enough. A skill like the one above is specific. It’s a Rust skill that assumes hexagonal architecture, a crates/ layout, and ast-grep recipes that probably depend on Rust’s syntax. It will not help a Spring Boot service. It will not help a Django app. It will not help a Go monorepo.

The same applies to almost every useful skill:

  • Testing skills depend on the test runner (cargo test, pytest, vitest, mvn test, go test).
  • Build skills depend on the build tool (cargo, maven, gradle, npm, pnpm, bazel).
  • Refactoring skills depend on the codemod engine, which depends on the language.
  • Convention skills depend on the framework (React vs Vue vs Angular vs Svelte; Spring vs Quarkus vs Micronaut).

The combinatorics get out of hand quickly. N languages × M frameworks × P build tools × Q architectural styles. A single team’s useful skill catalogue might be ten or twenty skills. A platform team supporting an organisation’s full polyglot estate is suddenly looking at hundreds.

Skill dependency fan-out A single skill at the top branches into five categories of dependencies — MCP servers, codemods, language tools, framework rules, and build tools. Each category lists examples to show how skills must be specialised per language, framework, and build tool. A summary box at the bottom explains why this becomes unmanageable without a marketplace. refactor-to-hexagonal one skill the developer asks for MCP servers substrate access symgraph scip-loader dep-cruiser-mcp arch-rules-mcp ... grows per substrate Codemods apply side OpenRewrite ast-grep jscodeshift ts-morph GritQL grows per language Language tools linters, checkers tsc, eslint cargo check, clippy mypy, ruff go vet, staticcheck spotbugs, checkstyle grows per language Framework rules conventions, idioms React, Next.js Spring, Quarkus Django, FastAPI Rails, Phoenix Axum, Actix grows per framework Build tools tasks, hooks cargo maven, gradle npm, pnpm, bun go, bazel nx, turbo grows per build tool N languages x M frameworks x P build tools x Q architectural styles Without a registry and plugin system, this becomes unmanageable - Skills duplicate effort per language / framework / build combo - MCP server dependencies aren't declared; users install manually - Updating a shared rule means editing every variant by hand - No standard discovery -- finding the right skill becomes folklore - Version drift between skill and its dependencies breaks silently

Worse, skills have dependencies that they can’t currently declare. The hexagonal architecture skill above needs symgraph installed, ast-grep available, and Rust toolchain on the path. Today, the way users satisfy those dependencies is “read the skill’s prerequisites section and install everything by hand.” That doesn’t scale.

This is the moment in the maturation of a developer ecosystem when package managers, registries, and dependency declarations stop being optional and become structural.

Claude Code’s plugin system is one answer to this. The hierarchy is:

  • A skill is a folder with SKILL.md.
  • A plugin is a packaging wrapper — .claude-plugin/plugin.json — that bundles one or more skills, plus optional commands, hooks, sub-agents, and MCP server declarations.
  • A marketplace is a Git repository with .claude-plugin/marketplace.json listing plugins available for install. Users add the marketplace once and install plugins from it by name.

This gets you a real distribution channel. A platform team publishes a marketplace; teams across the organisation install the plugins they need; updates propagate via a single marketplace update rather than N hand-edits across repos.

The structure for a personal or team-level marketplace is small enough to fit in your head:

your-marketplace/
├── .claude-plugin/
│   └── marketplace.json
├── plugins/
│   └── fitness-functions/
│       ├── .claude-plugin/
│       │   └── plugin.json
│       └── skills/
│           ├── layered-architecture/
│           │   └── SKILL.md
│           └── modularity-review/
│               └── SKILL.md
└── README.md

The marketplace.json catalogues plugins; each plugin’s plugin.json describes what’s in it; each skill’s SKILL.md describes when to use it. End users run /plugin marketplace add <owner>/<repo> once, then /plugin install <plugin>@<marketplace> per plugin.

This solves discovery, versioning, and updates. What it doesn’t solve directly is the dependency on external MCP servers — which is where the next piece comes in.

A skill that depends on symgraph or any other MCP server has historically had no way to declare that dependency machine-readably. The skill says “install symgraph first” in prose; the user reads the prose, fetches symgraph from its own release page, installs it, configures their MCP client, then runs the skill.

That manual chain is fine for one skill. It’s a wet sock when you’re trying to ship a marketplace of fifty.

The piece that completes the picture is a delivery tool for the MCP servers themselves — something analogous to npx but for compiled binaries. Call it bx. It does for native binaries what npx does for Node packages: spec the binary, fetch the right asset from a GitHub release for the current platform, cache it, exec it with stdio passthrough.

That last property is what makes it useful in MCP configs:

{
  "mcpServers": {
    "symgraph": {
      "command": "bx",
      "args": ["run", "grahambrooks/symgraph@^v2026.4", "--", "serve"]
    }
  }
}

The user installs bx once. Every MCP server is then a one-line config entry that handles fetching, caching, version pinning, and execution.

The truly interesting move is the skill-level integration. Skills declare their MCP server dependencies in frontmatter:

---
name: explore-code
description: Use when navigating an unfamiliar codebase.
mcp_servers:
  - grahambrooks/symgraph@^v2026.4
  - someorg/dep-cruiser@^v1
---

When the skill loads, the runner calls bx ensure --skill ./skill-dir/, which resolves and caches every declared dependency before the skill runs. Skills now compose with their dependencies attached. A marketplace becomes self-contained: install a plugin, every skill in it just works.

This is the package-manager moment for the skill ecosystem. Without it, every useful skill carries a “first, install these…” preamble. With it, skills are real composable units.

Concretely, here’s what an inner-loop iteration looks like with all of this wired up. A developer says “refactor the order placement flow to remove the direct database call from the domain layer”. The agent:

  1. Plans. Reads the architecture declaration. Identifies the violation pattern: domain code importing from infrastructure.
  2. Queries the substrate. Calls symgraph-search for the order placement module. Calls symgraph-callees on each function to find what it depends on. The substrate returns a structured graph slice, not a pile of text.
  3. Identifies sites. Picks out three call sites where the violation occurs.
  4. Plans the transformation. For each site, decides the inversion: extract a repository trait into the domain layer, change the call site to depend on the trait, move the concrete implementation to the infrastructure layer.
  5. Dispatches codemods. Calls ast-grep recipes — one to extract the trait, one to rewrite the call site, one to relocate the implementation. The codemod operates on the AST, not text.
  6. Verifies. Re-runs the layered-architecture fitness function. The sensor returns no violations. Re-runs tests; they pass.
  7. Returns the diff to the developer. Three commits, all green.

If any sensor had fired during step 6, its rich payload — violation, rationale, fix hint — would let the agent loop back to step 4 and try again, without involving the developer. The developer only sees the final result.

The classic inner loop took fifteen minutes for the same refactor with two test cycles and a typo. The agent-mediated loop takes ninety seconds with the developer in the loop only at the prompt and the diff review.

If you want to build toward this, the order matters. Here’s what I’d prioritise:

First, the substrate. Stand up an AST MCP server for your codebase. symgraph is one option; you can also build your own around tree-sitter for the languages you actually use. Add a SCIP index. This gives you the query layer that everything else depends on. Without it, every skill ends up doing its own text parsing badly.

Second, the sensor payload format. Before writing more rules, fix the messages on the rules you have. Every tool output should carry violation, rationale, fix hint, severity, and suppression pattern in a structured shape. This is the single highest-leverage change for agent productivity.

Third, the baseline store. Even something as simple as committing a metrics.json per main-branch build, and diffing against it on PRs, gives you a per-PR drift signal that’s far more actionable than absolute thresholds.

Fourth, package what you have as a marketplace. Even if it’s just three or four skills, the discipline of writing them as a plugin with a manifest forces a quality threshold and makes them reusable beyond their first author.

Fifth, solve the binary delivery problem. Whether that’s bx, Homebrew taps, MCP-bundle files, or just well-written install scripts, get the friction of “install three binaries before the skill works” down to zero. This is the part that determines whether skills actually compose for users who didn’t author them.

The deeper architectural point is that we’re building the same kind of structural scaffolding for agents that we built for humans over the last three decades — package managers, registries, dependency graphs, semantic queries, observability. The components look different (skills instead of libraries, MCP servers instead of services, fitness functions instead of unit tests) but the patterns repeat. The teams that get there first are the ones who treat agent productivity as an engineering discipline, not a magic spell.