Understanding the Agent Loop: How Tool-Using LLM Systems Actually Work

If you are building with tool-calling models, the most important design decision is often not the prompt. It is the loop around the model. An LLM can decide it wants to use a tool, but it cannot execute that tool by itself. The surrounding application or SDK has to assemble context, inspect the model response, run tools, append results, and continue until a final answer is produced. That runtime cycle is the agent loop. This article explains what the agent loop actually is, where the model stops and the harness begins, how tool calling works step by step, and which engineering tradeoffs show up once you move beyond demos. An agent loop is the execution cycle that lets a model inspect context, request tools, observe results, and continue until it reaches a final answer. The model is only one part of the system. The harness or SDK owns orchestration: prompt assembly, tool execution, retries, approvals, and termination. State management matters as much as prompting. If you lose prior tool outputs or conversation continuity, the agent will behave like it forgot what just happened. Performance depends heavily on prompt growth control, stable prompt prefixes, caching, and bounded tool output. Safe agent design requires validation, approval gates for side effects, and clear rules for concurrency and history propagation. The core problem is simple: a one-shot model call cannot inspect the world, act on it, and adapt to the result unless something outside the model manages that cycle. That is the harness's job. OpenAI's Codex architecture describes a user interaction as a turn, but a single turn may contain multiple internal iterations of model inference and tool execution. The OpenAI Agents SDK describes the same idea directly: invoke the agent, check whether there is final output, handle handoffs if needed, otherwise execute tool calls and re-run. A practical mental model looks like this: Build the input state. Call the model. Inspect the response. If the model requested tools, validate and execute them. Append tool results back into context. Call the model again. Stop only when the model returns a final answer. That means the harness, not the model alone, is responsible for: Prompt assembly Message history management Tool schema registration Tool execution Validation and error handling Retry logic Approval workflows State persistence Loop termination This is why two systems using the same model can behave very differently. Their harnesses may make different decisions about context, tool ordering, truncation, approvals, and continuation. Before the loop can run, the system needs to define what the model sees. A typical turn includes: System or developer instructions Tool definitions or schemas Previous messages Previous tool-call results The current user request Sometimes environment state, session metadata, or hidden runtime instructions This matters because follow-up reasoning depends on prior observations being present. If the model requested a tool in one iteration and the result is not added back correctly, the next iteration cannot build on that work. There are really two loops to think about: Inner loop: model inference and tool execution inside a single user turn Outer loop: the broader multi-turn conversation across user follow-ups This distinction shows up clearly in Codex-style architectures. A user asks for something once, but the agent may internally perform several tool steps before replying. Then the next user message arrives, and the entire conversation thread continues from that accumulated state. That is why state continuity is not optional. Without it, the outer loop breaks and the inner loop starts reasoning from an incomplete view of reality. Once the harness provides the current turn state, the model has a decision boundary: answer directly, or request one or more tools. Tool calling works because the model is given structured tool definitions. Instead of producing only natural language, it can emit a structured request indicating which tool it wants and which arguments it wants to pass. At that point, the model is effectively yielding control back to the application. With custom tools, the client harness must take over, run the tool, and return the result. With hosted tools, more of that orchestration can happen inside the API itself. This is an important architectural choice: Tool type Who orchestrates execution? Main tradeoff Hosted tool API/runtime handles more of the loop Simpler orchestration, less direct control Custom function tool Client harness executes it More flexibility, more operational responsibility MCP tool Depends on integration and discovery flow Adds discovery and caching concerns The advantage of client-side orchestration is control. The cost is that you now own the failure modes. Once the model emits a tool request, the harness needs to do more than just run it. A safe harness should validate: Tool name Argument structure Argument types Permission rules Whether the tool is read-only or mutating This is not just a security concern. It is also a quality concern. If the model asks for a tool with invalid arguments, returning an explicit tool error often gives it enough signal to self-correct on the next loop iteration. The model needs a structured observation that closes the action-observation cycle. A minimal pattern looks like this: response = client.responses.create( input=initial_question, **MODEL_DEFAULTS, ) while True: function_responses = invoke_functions_from_response(response) if len(function_responses) == 0: print(response.output_text) break print("More reasoning required, continuing...") response = client.responses.create( input=function_responses, previous_response_id=response.id, **MODEL_DEFAULTS, ) The key detail is not just the loop itself. It is that the next request continues from the previous response and includes the tool outputs produced by the harness. A more explicit observation payload looks like this: context.append({ "type": "function_call_output", "call_id": tool_call.call_id, "output": str(result), }) response_2 = client.responses.create( model="o3", input=context, tools=tools, store=False, include=["reasoning.encrypted_content"], ) print(response_2.output_text) That function_call_output item is the observation that lets the model continue reasoning with the tool result now available in context. One of the easiest ways to break an agent is to lose state continuity. There are several patterns in current OpenAI tooling: Full history replay managed by the client previous_response_id for server-managed continuation conversation_id for conversation continuity SDK-managed session persistence Each approach has tradeoffs. With full replay, the client sends all prior messages and tool results every time. This is simple to reason about, but payload size grows quickly. With server-managed continuation, the client can send the new input along with a continuation identifier such as previous_response_id. That reduces payload size and offloads some history management. This example from the Agents SDK shows response chaining: from agents import Agent, Runner async def main(): agent = Agent(name="Assistant", instructions="Reply very concisely.") previous_response_id = None while True: user_input = input("You: ") # Setting auto_previous_response_id=True enables response chaining # automatically for the first turn, even when there is no actual # previous response ID yet. result = await Runner.run( agent, user_input, previous_response_id=previous_response_id, auto_previous_response_id=True, ) previous_response_id = result.last_response_id print(f"Assistant: {result.final_output}") This is convenient, but you still need to choose a consistent state strategy. The Agents SDK documentation explicitly warns against combining session persistence with conversation_id, previous_response_id, or auto_previous_response_id in the same run path. That is a practical design rule: pick one continuity model per call flow. If you mix them, debugging becomes much harder because it is no longer obvious which state the model is actually seeing. As the loop continues, context grows. Every new model call may include prior instructions, tool schemas, user messages, and tool outputs. If you simply keep appending everything forever, the number of bytes sent over the lifetime of a conversation can grow quickly. The Codex architecture discussion highlights a useful principle: keep old prompt content as an exact prefix of the new prompt whenever possible. That improves prompt-cache reuse. In practical terms, stable ordering matters for: System instructions Tool definitions Environment metadata Prior messages If these move around between calls, cacheability drops. The same issue affects reproducibility. Even tool-definition ordering bugs can introduce cache misses and inconsistent behavior. A production harness usually needs some combination of: Truncating verbose tool output Summarizing old history Keeping static instructions stable and early Bounding shell or retrieval output Preserving only the most relevant observations verbatim This matters even more for shell, retrieval, or computer-use tasks, where output can become noisy very quickly. The goal is not just lower cost. It is maintaining a usable reasoning substrate for the model. The more powerful the tools, the more important the harness becomes. Read-only tool calls are different from side-effectful operations. For example: Fetching documentation is relatively low risk Sending an email, editing a file, or executing a deployment is high risk Mutating actions should often be: Serialized instead of run concurrently Approval-gated Sandboxed when possible Logged with enough metadata for auditability This is one reason agent frameworks expose concurrency settings and approval workflows. You cannot safely assume that a tool request is correct just because it came from the model. Validate the arguments before execution, and return structured error feedback when something is wrong. That gives the loop a chance to recover without silently doing the wrong thing. OpenAI's function-calling guidance for reasoning models notes that you should not force extra "think more before every function call" prompting. Reasoning models already perform internal reasoning, and excessive prompting can degrade performance. That is a useful reminder that harness quality is often more important than prompt verbosity. Once a single-agent loop works, teams often add handoffs or agent-as-tool patterns. Conceptually, the loop stays the same: Invoke one agent. Detect whether it produced final output, a tool request, or a handoff. Route execution accordingly. Continue until termination. The Agents SDK summarizes the semantics clearly: The agent will run in a loop until a final output is generated. The loop runs like so: 1. The agent is invoked with the given input. 2. If there is a final output (i.e. the agent produces something of type `agent.output_type`), the loop terminates. 3. If there's a handoff, we run the loop again, with the new agent. 4. Else, we run tool calls (if any), and re-run the loop. The tricky part is not the idea of handoffs. It is history propagation. Recent community discussions show that when one agent is exposed as a tool to another, developers are often unsure how much history is forwarded automatically. In practice, this means you should not assume that all relevant context follows the handoff unless your framework explicitly guarantees it. For multi-agent systems, explicit context composition is often safer than implicit inheritance. Most agent bugs look obvious in hindsight. Symptoms: The agent repeats itself It forgets prior tool results MCP tool discovery keeps happening again Check whether you are correctly passing previous_response_id, conversation_id, or full message history. Symptoms: Long, low-quality responses Poor tool selection The model misses relevant facts Check whether tool output is too verbose. Cap output size, summarize logs, and keep only useful observations. Symptoms: Cache misses Inconsistent behavior across similar runs Higher token usage than expected Check the ordering of instructions, tool schemas, and environment metadata. Symptoms: Invalid API calls Accidental side effects Hard-to-reproduce failures Validate tool names and arguments before execution. Treat tool requests as proposals, not commands. Symptoms: Race conditions Conflicting writes Non-deterministic outcomes Run read-only operations concurrently only when safe. Serialize or approval-gate mutating operations. The recent OpenAI ecosystem changes make one thing clear: the important boundary is no longer just model prompting. It is orchestration design. The Responses API, Agents SDK, MCP integrations, and Codex harness examples all point to the same execution model: The model chooses actions The harness controls reality State continuity determines coherence Prompt discipline determines scalability Safety controls determine whether the system is usable in practice If you are building an agent today, the fastest path to a better system is often not a new prompt. It is a better loop. The agent loop is the action-observation cycle that makes tool-using LLM systems possible. The harness owns orchestration: context assembly, tool execution, validation, retries, approvals, and termination. State continuity is critical. Losing prior responses or tool outputs breaks reasoning quality quickly. Server-managed continuation can simplify history handling, but you should choose one state strategy consistently. Prompt growth is an engineering problem. Stable prefixes, truncation, compaction, and bounded tool output all matter. Hosted tools and custom tools shift the orchestration boundary in different ways. Multi-agent patterns introduce history propagation and control-flow complexity that should be designed explicitly. Safe execution requires argument validation, side-effect controls, and careful concurrency handling. If you want to go deeper, these resources are worth reading next: OpenAI function-calling guide OpenAI reasoning function-calls cookbook OpenAI Agents SDK running agents documentation OpenAI's Codex architecture write-up on the agent loop OpenAI MCP tool guide An agent loop is not a small implementation detail. It is the core runtime pattern that turns a model into a working system. Once you see the loop clearly, many design decisions make more sense: why history management matters, why tool output must be bounded, why prompt ordering affects cacheability, and why side effects need approval and validation. If you are building with tool-calling models, make the loop explicit first. Define how state is carried forward, how tools are validated, how observations are appended, and how the run terminates. In practice, that foundation will usually improve reliability more than any prompt tweak.

Understanding the Agent Loop: How Tool-Using LLM Systems Actually Work

Key Takeaways

Related Articles

Beyond SLSA: How to Stop Zero-Click CI/CD Worms with a 9-Step Plan

I built a free IDE extension to catch malicious npm packages before they wreck your project

What GLM-5.2 Changes for Long-Horizon Coding

Tinfoil (YC X25): Verifiable Privacy for Cloud AI

Discussion

Understanding the Agent Loop: How Tool-Using LLM Systems Actually Work

Key Takeaways

Related Articles

Beyond SLSA: How to Stop Zero-Click CI/CD Worms with a 9-Step Plan

I built a free IDE extension to catch malicious npm packages before they wreck your project

What GLM-5.2 Changes for Long-Horizon Coding

Tinfoil (YC X25): Verifiable Privacy for Cloud AI

Discussion

Related Articles

Dev.to
Beyond SLSA: How to Stop Zero-Click CI/CD Worms with a 9-Step Plan
The security perimeter of modern software development has officially collapsed. Historically, protecting your supply chain meant scanning static containers and blocking typosquatted packages. But between late 2025 and mid-2026, a terrifying paradigm shift occurred: adversaries abandoned passive repo

Dev.to
I built a free IDE extension to catch malicious npm packages before they wreck your project
Supply-chain attacks via npm are up year-over-year — packages like event-stream, after the fact, so I built NPM Safety Guard. It scans your package.json and lockfiles right inside your editor — no separate CLI step. Here's what it currently catches across 22 detection layers: Known malicious packag

Dev.to
What GLM-5.2 Changes for Long-Horizon Coding
GLM-5.2 is worth paying attention to because it is not just another large language model release. In the official Hugging Face announcement, the model is positioned around long-horizon tasks: a stable 1M-token context window, flexible effort levels, and an MIT license. That combination matters for d

Dev.to
Tinfoil (YC X25): Verifiable Privacy for Cloud AI
Tinfoil (YC X25) frames verifiable privacy as a cryptographic guarantee for cloud AI inference pipelines. The core thesis is that trust must move beyond marketing claims to mathematically auditable proofs for every token generated. While that architectural vision is sound, the implementation gap lie