Jon Hulsinger

Agents that plan, act, check, and retry are becoming a real engineering pattern. The shift is obvious: the magic is no longer just in a single prompt. It is in the system that keeps the model moving: gather context, make a change, run a tool, inspect the result, revise, and repeat.

That sounds elegant until the bill arrives.

Anthropic has reported that ordinary agent workflows can use roughly 4x the tokens of a chat interaction, while multi-agent research systems can use roughly 15x. That is not a universal law of coding agents, and Anthropic’s strongest results came from research tasks that parallelize well. But it is the right warning sign: autonomy is not free. Every retry, tool call, context refresh, and self-check has a cost.

The mistake is treating loops as magic. The opportunity is treating them as infrastructure.

The Trap: Bigger Loops Look Smarter

The trap is thinking the answer is always more: more agents, more context, more reflection, more delegation, more loops. Sometimes that works. A broad research task can benefit from multiple agents searching different branches at once. But plenty of engineering work does not look like that. Many coding tasks are narrow, sequential, and full of hidden constraints. Throwing an autonomous swarm at them can create the appearance of intelligence while quietly burning money, context, and trust.

The better pattern may be smaller.

AutoResearch Shows the Useful Shape

Karpathy’s AutoResearch is the useful reference point. It is not an agent vaguely “doing research.” It is a constrained loop. The agent edits a training file, runs a short experiment, checks a clear metric, keeps improvements, reverts failures, and repeats.

In the published setup, the loop is bounded: one file, short runs, one score. That is why it works as a signal. The system is not impressive because it wanders. It is impressive because it knows what counts as progress.

That points to a different future for coding agents. Define the workflow first, then use the smallest capable model for each step.

Small Models Win When the Job Is Clear

The bigger opportunity is turning repeatable natural-language workflows into smaller models. A lot of software work starts as messy language but follows a defined path once you chart it.

Triage this issue. Classify the failure. Choose the next tool. Generate the test. Apply the migration. Verify the diff. Write the release note.

Lint repair, dependency updates, known refactors, and routine bug isolation often have stable inputs, constrained actions, and objective scoring. Once a workflow can be drawn as a decision chart, logged as input/output traces, and measured against a clear score, it probably deserves a specialized model.

That is where smaller retrained models start to matter. A Qwen-class model, fully retrained or aggressively adapted around a narrow workflow, can have better economics than asking the largest model to rediscover the same procedure every run.

The goal is not to make a small model generally brilliant. The goal is to make it very good at one bounded job: stable inputs, limited moves, clear feedback, measurable success.

This is the part that gets lost in the hype: loop engineering is not just “let the model keep going.” It is deciding what the model should not have to think about.

The workflow should carry the structure. The model should handle the ambiguous language inside that structure.

Good loops reduce the problem before they spend tokens on it. They curate context. They retrieve just in time. They write state somewhere durable instead of relying on the conversation to remember everything. They make the action space smaller. They define a stop condition. They measure cost, latency, regressions, and success rate. They treat the model as one part of a production system, not the whole system.

Verification Is the Steering Wheel

Verification is still the bottleneck. A loop is only as good as its stopping condition.

AI review can catch style issues, naming problems, obvious regressions, and some real bugs. It is weaker at subtle logic errors, race conditions, security boundaries, and architecture drift. Same-family maker/checker loops can share the same blind spots. Passing tests helps, but only if the tests encode the real contract.

“The agent approved it” is not proof. “The tests passed” is not proof either if the tests are incomplete.

The Attack Surface Grows With the Loop

Security also gets sharper, not softer. An agent with shell access, repo access, credentials, browser access, and MCP tools is not just a chatbot. It is an actor inside your system.

Prompt injection is not solved. Tool poisoning is real. Credential sprawl gets easier when the instruction is “just make it work.” The more autonomy you give the loop, the more the boundaries matter.

Build Loops Like Systems

So yes, build loops. But build them like engineering systems.

Short bursts. Narrow scope. Cheap iterations. Explicit state. Strong evals. Human review where judgment matters. Smaller models where the workflow is clear. Frontier models where the uncertainty actually justifies the cost.

The winners here will not simply have the biggest context windows or the flashiest demos. They will have tighter loops, cheaper specialized models, ruthless verification, and security boundaries that assume the agent will eventually do something strange.

Loop engineering is where the leverage is. It is also where the failures compound fastest.

The future is not one giant model thinking harder forever. It is smaller, sharper systems doing narrower jobs inside carefully designed loops.

The loop is the product now. Build it like production infrastructure, not a prompt experiment.

References

Originally published on X

AI. Software.
Writing. Design.

Loop Engineering Is a Token Bonfire

Loop Engineering Is a Token Bonfire

The Trap: Bigger Loops Look Smarter

AutoResearch Shows the Useful Shape

Small Models Win When the Job Is Clear

Verification Is the Steering Wheel

The Attack Surface Grows With the Loop

Build Loops Like Systems

References

AI. Software.Writing. Design.

The Trap: Bigger Loops Look Smarter

AutoResearch Shows the Useful Shape

Small Models Win When the Job Is Clear

Verification Is the Steering Wheel

The Attack Surface Grows With the Loop

Build Loops Like Systems

References

AI. Software.
Writing. Design.