How an AI Agent Thinks in a Loop

Think about a chef preparing a complex dish. They don’t just throw everything in a pot and walk away. They cook a little, taste, adjust the seasoning, cook some more, taste again. Each cycle refines the result. The dish emerges from iteration, not from a single act.

Claude Code’s thinking works the same way. When you type “fix this bug,” it doesn’t fire a single API call and return a response. It reads a file, runs a test, edits the code, runs the test again, checks the output, maybe edits once more. Each step requires calling the AI model, executing tools, feeding results back, and deciding what to do next.

This is the query engine — the brain of every AI agent. And understanding how it works is the key to understanding why AI coding assistants behave the way they do. I found the implementation fascinating — and also illuminating about the trade-offs inherent in building agentic systems.

The Problem: Multi-Turn Reasoning Is Hard

A simple chatbot takes a message and returns a response. One shot. Done. But an AI agent needs to act in the world. Consider what happens when you ask Claude Code to “find and fix the failing test in src/utils”:

The agent calls the API with your message
The model decides to read the test file — it emits a tool_use block
The agent executes that tool, captures the result
It sends the result back to the model
The model reads the file, decides to run the test suite — another tool_use
The agent runs the test, captures the failure output
Back to the model: now it understands the bug, decides to edit a source file
The agent performs the edit
The model wants to verify — runs the test again
Tests pass. The model emits a final text response

That is five round trips to the API in a single user interaction. Each one requires streaming, tool detection, permission checks, error handling, context management, and state tracking. How do you manage this loop without it becoming an unmaintainable mess?

The answer in Claude Code is queryLoop — a while(true) generator function that orchestrates every turn of reasoning. Let’s trace exactly how it works.

How Claude Code Solves It

Architecture Flow

See the diagram above for a visual overview of this flow.

Try the Interactive SimulationFull View →

The High-Level Flow

At the highest level, the query engine is a cycle. A user message enters, the model streams a response, tools are detected and executed, and results feed back into the next iteration until the model produces a final text response with no tool calls.

High-Level Query Flow

User sends message

↓

QueryEngine.submitMessage

↓

Build system prompt + context

↓

queryLoop enters while(true)

↓

Stream API response

↓

Tool use detected?

YES↓

Execute tools

↓

Collect results

↓

Append messages

↓

Max turns?

NO↑

back to loop

YES↓

max_turns

NO↓

Stop hooks pass?

YES↓

Completed

BLOCKED↓

Append errors

↑

back to loop

The simplicity is deceptive. Inside that while(true), there are seven distinct reasons the loop might continue back to the top — and each one represents a different recovery or continuation strategy.

The Seven Continue Sites

What’s interesting here is that the queryLoop function in query.ts (line 241) maintains a State object that is reassigned at each continue site. The transition field records why the loop continued, making implicit states inspectable after the fact. Here are the seven continue sites, mapped as a state diagram:

Seven Continue Sites — State Diagram

queryLoop entry

↓

Streaming

collapse_drain_retry

API returns 413, staged collapses exist

retry with drained messages

reactive_compact_retry

API returns 413 or media size error

retry with compacted context

max_output_escalate

Output token limit hit, escalate 8k to 64k

retry with higher max_output_tokens

max_output_recovery

Output token limit hit, injecting resume nudge

retry with resume message appended

model_fallback

FallbackTriggeredError, switch model

retry with fallback model

stop_hook_blocking

Stop hooks return blocking errors

retry with errors appended

token_budget_continue

Token budget says keep going

continue with nudge message

All seven states loop back to Streaming

next_turn

Tool results collected, needsFollowUp = true

next iteration with tool results → Streaming

no tool use + stop hooks pass + budget complete

↓

Completed

Each of these states is encoded as a string in the transition.reason field: 'collapse_drain_retry', 'reactive_compact_retry', 'max_output_tokens_escalate', 'max_output_tokens_recovery', 'model_fallback', 'stop_hook_blocking', 'token_budget_continuation', and 'next_turn'. But they are not declared as an enum or a union type. They exist only as string literals scattered across a very large orchestration file (src/query.ts, ~1.7K lines). More on that later.

A Single Turn with Tool Use

Let’s zoom into a single iteration — what happens between one API call and the next when the model decides to use a tool.

Single Turn with Tool Use

User

→

QueryEngine

submitMessage(“fix the bug”) — build system prompt, context, messages

QueryEngine

→

Claude API (Stream)

Stream request with tools + messages

Streaming response chunks: text delta / thinking delta

API → QueryEngine → User: yield StreamEvent

tool_use block detected (e.g., Read file) — needsFollowUp = true

QueryEngine

→

Tool Orchestration

runTools([tool_use_block], assistantMessages, canUseTool, context)

partitionToolCalls() — read-only vs mutating

Tool Orchestration

→

Tool (Read, Bash)

Execute (concurrent if read-only) → result content → yield MessageUpdate with tool_result

Append assistant message + tool_result to messages

state = next_turn — continue (back to while-true top)

QueryEngine

→

Claude API (Stream)

Stream request with updated messages

Final text response (no tool_use) → yield final assistant message

Completed

reason: ‘completed’

The key insight: each API call is a full streaming request. The model sees the entire conversation history (or the compacted version of it), including all previous tool results. Context management — autocompact, microcompact, snip, context collapse — runs at the top of each iteration, before the API call, to keep the conversation within token limits.

Streaming vs Orchestrated Tool Execution

Claude Code has two tool execution strategies, selectable via a feature gate. They represent fundamentally different trade-offs, and I found it instructive to compare them:

Streaming vs Orchestrated Execution

toolOrchestration.ts — runTools()

Waits for full response before executing

Receive all tool_use blocks

↓

partitionToolCalls

↓

Is batch read-only?

YES↓

Concurrent (10 max)

NO↓

Serial (one at a time)

↓

Queue context modifiers

↓

Apply modifiers in order

↓

Yield MessageUpdates

StreamingToolExecutor — execute as parsed

Starts tools during API streaming

Tool arrives during stream

↓

addTool

↓

Concurrent-safe?

YES↓

Start immediately

NO↓

Wait then execute

↓

Buffer results in order

↓

Yield in submission order

toolOrchestration.ts (runTools): Waits until the API response is fully streamed, collects all tool_use blocks, then partitions them. Read-only tools (like Read, Glob, Grep) run concurrently up to a configurable limit (default 10). Mutating tools (like Write, Edit, Bash) run serially. This is safe and predictable.

StreamingToolExecutor: Starts executing tools the moment they are parsed from the stream, before the API response is complete. A tool_use block for “Read file X” can begin executing while the model is still generating the next tool_use block for “Read file Y.” This reduces latency significantly for multi-tool turns.

The partitioning logic is the same in both paths: isConcurrencySafe is determined by calling tool.isConcurrencySafe() with the parsed input. For Bash, concurrency-safety typically tracks whether the command is read-only (cat, ls) versus mutating (rm, mv). The concurrency limit is configurable via CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY (default 10).

Smart Pattern: The StreamingToolExecutor tracks tool status through a lifecycle ('queued' | 'executing' | 'completed' | 'yielded') and maintains a child abort controller. If a Bash tool errors, the child controller fires to kill sibling subprocesses immediately — without aborting the parent query loop. A tool-level failure should not terminate the entire turn.

An Improved Design: Explicit State Machine

The current implementation works. It has shipped to millions of users. But the while(true) with seven continue sites and a transition field that is only set after the fact is a design that resists formal reasoning. You cannot draw a state diagram from the code without reading a very large orchestration function and mentally tracking every path.

Here is what an explicit typed state machine would look like:

Improved Explicit State Machine

Idle

submitMessage(prompt)

↓

Preparing Context

context built, API call initiated

↓

Streaming

tool_use detected↓

Tool Execution

↓

Feeding Results

→ back to Streaming

no tool use↓

Evaluating Stop Hooks

PASS↓

Check Budget

DONE↓

→ Streaming

BLOCK

→ Streaming

API error↓

Recovering

→ Streaming

FAIL↓

↓

Complete

Each state would be a discriminated union:

type QueryState =
  | { kind: 'idle' }
  | { kind: 'preparing_context'; messages: Message[] }
  | { kind: 'streaming'; model: string; attempt: number }
  | { kind: 'tool_execution'; pendingTools: ToolUseBlock[] }
  | { kind: 'feeding_results'; toolResults: Message[] }
  | { kind: 'evaluating_stop_hooks'; assistantMessages: AssistantMessage[] }
  | { kind: 'checking_budget'; turnTokens: number }
  | { kind: 'recovering_from_error'; error: RecoveryError; attempt: number }
  | { kind: 'complete'; reason: TerminalReason }

Transitions would be explicit functions, not state = { ... }; continue. Each transition could be logged, tested, and visualized. The while(true) would become a switch on state.kind, and each case would return the next state.

The Smart Part: Generator-Based ask() Wrapper

One design decision in the query engine deserves particular praise. The ask() function is an AsyncGenerator that wraps QueryEngine.submitMessage(), which itself is an AsyncGenerator wrapping queryLoop().

This generator pipeline enables a single implementation to serve both streaming and batch consumption:

REPL (interactive mode): The Ink UI iterates the generator with for await, rendering each StreamEvent as it arrives. Text deltas update the screen in real-time. Tool results appear as they complete.
SDK / Headless mode: The same generator is consumed by ask(), which constructs a QueryEngine, calls submitMessage(), and yields SDKMessage objects. SDK callers can iterate for streaming or collect into an array for batch.
Subagents: When Claude Code spawns a sub-agent (e.g., for background compaction), it calls the same queryLoop with different parameters. The generator contract is identical.

This is a textbook application of the generator pattern: decouple production from consumption. The query engine does not know or care whether its output is rendered to a terminal, collected into an array, or piped to another process.

The QueryEngine class itself is designed as one-instance-per-conversation. State — messages, file cache, usage totals, permission denials — persists across turns within the same engine instance. Each call to submitMessage() starts a new turn, clearing turn-scoped state while preserving conversation-scoped state.

What Should Change

The while(true) loop with seven continue sites is a liability at scale. Here is why:

Debugging is archaeology. When a user reports that Claude Code “got stuck in a loop,” the investigation requires reading the full queryLoop function, identifying which continue site fired, and mentally reconstructing the state at that point.

Testing is incomplete. You cannot write a unit test for “the loop enters reactive_compact_retry and then falls through to completed” without mocking the entire query infrastructure. An explicit state machine would let you test transitions in isolation.

New continue sites are risky. Every new recovery strategy means another state = { ... }; continue block, another 15 lines of boilerplate, another place where a typo in one of the nine state fields creates a subtle bug.

No formal verification. You cannot run a model checker on an implicit state machine. With an explicit one, you could verify properties like “reactive_compact_retry is attempted at most once” or “stop_hook_blocking never follows stop_hook_blocking” statically.

The transition field was the right first step. The next step is promoting it from a diagnostic annotation to the actual control flow mechanism.

Takeaway

Every AI agent needs a query loop. It is the fundamental abstraction: take a user message, call the model, detect tool use, execute tools, feed results back, repeat until done. Claude Code’s queryLoop is a mature, battle-tested implementation of this pattern, handling edge cases (token limits, context compaction, model fallback, stop hooks, budget tracking) that simpler implementations ignore.

But the lesson is clear: make the states explicit from day one. The while(true) pattern works when you have two or three continue sites. At seven — with context collapse, reactive compaction, output token escalation, and budget continuation all competing for control flow — you need a typed state machine.

This is Part 2 of the “Inside Claude Code” series.

← Part 1: Boot Sequence | Part 3: The Tool System →

Inside Claude Code: How an AI Agent Thinks in a Loop

How an AI Agent Thinks in a Loop

The Problem: Multi-Turn Reasoning Is Hard

How Claude Code Solves It

The High-Level Flow

The Seven Continue Sites

A Single Turn with Tool Use

Streaming vs Orchestrated Tool Execution

An Improved Design: Explicit State Machine

The Smart Part: Generator-Based ask() Wrapper

What Should Change

Takeaway