/ 20 min read

Inside Claude Code: Why Your AI Should Talk While It Thinks

Claude Code's streaming-first API design — progressive token rendering, concurrent tool execution, and why batch responses are dead. Part 4 of 10.

Why Your AI Should Talk While It Thinks

Imagine texting a friend who goes silent for 30 seconds, then dumps an entire paragraph on you all at once. Now imagine a different friend who types “hmm… let me think… okay so the issue is…” — you can see them working through the problem in real time. You stay engaged. You feel heard.

That is the difference streaming makes in an AI application.

Claude Code chose the second friend. Streaming is not an optimization bolted on after the fact. It is the default path — the architectural foundation that every other subsystem assumes exists. Non-streaming is the fallback. Understanding why that inversion matters, and how the implementation actually works, is the subject of this post.

The Problem: Compounding Silence

Large language model API calls take anywhere from 2 to 30 seconds depending on prompt complexity, model load, and output length. If your application waits for the complete response before rendering anything, your user stares at a blank screen (or a spinner, which is barely better) for the entire duration.

But it gets worse. Modern AI applications do not just generate text — they use tools. Claude Code can read files, execute shell commands, search codebases, and write code. Each tool invocation takes additional seconds. A single turn might involve the model generating some text, calling a tool, waiting for the result, generating more text, calling another tool, and so on. Without streaming, these latencies compound. A turn that involves three tool calls might take 15-20 seconds of pure silence.

Users will not wait that long. They will close the tab, switch applications, or lose trust in the tool. I found through studying the codebase that perceived performance has almost nothing to do with total response time and almost everything to do with time to first visible token.

How Claude Code Solves It

Architecture Flow

See the diagram above for a visual overview of this flow.

Try the Interactive SimulationFull View →

Claude Code’s streaming implementation lives primarily in three files that form a pipeline from HTTP connection to rendered output:

  • claude.ts — The API client layer that initiates streaming connections and yields StreamEvent objects via async generators
  • withRetry.ts — The retry/resilience layer that wraps API calls with exponential backoff, 529/429 handling, and model fallback logic
  • StreamingToolExecutor.ts — The concurrent tool execution engine that begins running tools as they are parsed from the stream, not after the full response arrives

Let’s trace the complete flow.

The Streaming Flow

This sequence diagram shows what happens during a single streaming turn, including tool execution that begins mid-stream:

Streaming Sequence
Client
API
StreamingToolExecutor
Tool Runtime
POST /messages (stream: true)Model begins generating
First token (~50ms)Tokens streaming
”Let me read that file…”tool_use block detected
addTool(block)Check concurrency
Execute toolRunning (concurrent)
More tokens streamingSecond tool_use block
addTool(block2)isConcurrencySafe? parallel
Tool 1 result readyComplete
Tool 2 result readyComplete
Yield results in orderResults buffered
Assemble final responsestream_end

The critical insight: the StreamingToolExecutor does not wait for the stream to finish. The moment a tool_use content block is fully parsed from the streaming response, addTool() is called, and the tool begins executing immediately. From the source:

addTool(block: ToolUseBlock, assistantMessage: AssistantMessage): void {
    const toolDefinition = findToolByName(this.toolDefinitions, block.name)
    // ... validation ...
    const isConcurrencySafe = parsedInput?.success
      ? Boolean(toolDefinition.isConcurrencySafe(parsedInput.data))
      : false
    this.tools.push({
      id: block.id,
      block,
      assistantMessage,
      status: 'queued',
      isConcurrencySafe,
      pendingProgress: [],
    })
    void this.processQueue()
}

The processQueue() call at the end is fire-and-forget (void). It evaluates concurrency conditions and starts execution immediately if the queue allows it. Tools that are marked concurrency-safe can run in parallel with each other. Non-concurrent tools get exclusive access — the queue blocks until they complete.

Batch vs. Streaming: Perceived Latency

The difference in user experience between batch and streaming is dramatic, even when total response time is identical:

Batch vs Streaming Comparison
Batch Mode
API Call
Wait 5s
Full Response
Render 0.1s
Perceived: 5.1s latency
Streaming Mode
API Call
First Token 50ms
Progressive Render
Tokens Keep Flowing
Perceived: 50ms latency

In batch mode, perceived latency equals total response time. In streaming mode, perceived latency equals time-to-first-token. For Claude’s API, that is typically 30-80ms. The total wall-clock time might be the same or even slightly longer (streaming has per-chunk overhead), but the felt experience is roughly 100x more responsive.

Claude Code implements this via queryModelWithStreaming(), an async generator function in claude.ts:

export async function* queryModelWithStreaming({
  messages, systemPrompt, thinkingConfig, tools, signal, options,
}: { /* ... */ }): AsyncGenerator<
  StreamEvent | AssistantMessage | SystemAPIErrorMessage, void
> {
  return yield* withStreamingVCR(messages, async function* () {
    yield* queryModel(messages, systemPrompt, thinkingConfig, tools, signal, options)
  })
}

The AsyncGenerator return type is the key architectural choice. Each yield pushes a StreamEvent to the consumer (the UI layer), which can render it immediately. The generator protocol naturally handles backpressure — if the UI cannot keep up, the generator pauses.

Retry Flow with 529/429 Handling

Network requests fail. APIs get overloaded. Claude Code’s retry layer in withRetry.ts handles this with a nuanced strategy that differentiates between foreground (user-blocking) and background (invisible) requests:

Retry Flow with 529/429 Handling
Client
withRetry
Anthropic API
Request (attempt 1)POST /messagesProcessing
429 Rate LimitedRejected
Exponential backoff - BASE_DELAY_MS = 500ms
Sleep 500ms (attempt 1)
POST /messages (attempt 2)Processing
529 OverloadedOverloaded
Foreground (user waiting)
Retry up to MAX_529_RETRIES=3
Sleep 1000ms (attempt 2)
POST /messages (attempt 3)
200 OK - Response returned
Background (summary, title)
Bail immediately on 529
No retry - reduce gateway
amplification during cascades
CannotRetryError

What I found particularly smart is the foreground vs. background distinction:

const FOREGROUND_529_RETRY_SOURCES = new Set<QuerySource>([
  'repl_main_thread',
  'sdk',
  'agent:custom',
  'agent:default',
  'compact',
  'hook_agent',
  'verification_agent',
  'side_question',
  'auto_mode',
  // ...
])

Background queries — generating conversation titles, computing summaries, running classifiers — bail immediately on 529 errors. The reasoning is explicit in the source comments: “during a capacity cascade each retry is 3-10x gateway amplification, and the user never sees those fail anyway.” This is infrastructure-aware design. The retry policy considers not just the individual request’s success but its impact on the overall system during degraded conditions.

The base delay is 500ms with exponential backoff. For unattended/persistent sessions, there is a separate mode (CLAUDE_CODE_UNATTENDED_RETRY) that retries indefinitely with a 5-minute max backoff ceiling and periodic heartbeat yields every 30 seconds to prevent the host environment from killing idle sessions.

The Decomposed Streaming Pipeline

Currently, claude.ts is a 3,419-line, 125KB file that handles HTTP client creation, streaming pipeline management, retry orchestration, beta header injection, cost tracking, model fallback, and non-streaming fallback logic. This diagram shows how these responsibilities could be decomposed:

Decomposed Streaming Pipeline
Incoming Request
Streaming Supported?
YES
StreamProcessor
TokenRenderer
Progressive output, AsyncGenerator yields
ToolDetector
Parse tool_use blocks, Validate schemas
StreamingToolExecutor
Concurrency control, Parallel safe tools, Ordered results
CostTracker
Token counting, Usage analytics
NO
NonStreamingFallback
Both paths use:
HttpClient
RetryPolicy
Connection mgmt, Headers, Beta featuresExponential backoff, 529/429, FG vs BG

StreamingToolExecutor is already properly separated — it lives in its own file with clear boundaries. But the rest of the pipeline is entangled within claude.ts.

What Makes This Smart

Three design decisions stand out as particularly well-considered:

Streaming is the default, not the optimization. The function is called queryModelWithStreaming. The non-streaming path is called executeNonStreamingRequest and is explicitly a “fallback.” The entire system assumes tokens arrive progressively. Most applications start with batch requests and add streaming later, resulting in streaming being a parallel code path with different semantics, different error handling, and different testing requirements. Claude Code avoids this by making streaming primary.

Tool execution overlaps with generation. The StreamingToolExecutor does not wait for the model to finish generating before beginning tool execution. Tools start running the moment their tool_use block is fully parsed from the stream. For a response that generates text, calls a file-read tool, generates more text, and calls a search tool, this means the file-read is already complete by the time the model finishes generating the second tool call. The latency of tool execution is hidden behind generation latency.

Consider a typical coding turn: the model generates “Let me check the test file,” then emits a tool_use to read tests/app.test.ts, then continues generating “Now let me look at the implementation.” While those 15 tokens of text are streaming to the user’s terminal, the file read is already running in the background. By the time the model emits its second tool call, the first result is already buffered and ready.

Concurrency control is per-tool, not global. Not all tools are safe to run in parallel. A bash command that writes to a file should not run concurrently with another bash command that reads the same file. The isConcurrencySafe check per tool definition allows safe tools (file reads, searches) to execute in parallel while unsafe tools get exclusive access. Results are always emitted in the order tools were received, not the order they completed — preserving deterministic output even with parallel execution.

Graceful degradation under stream failure. When a streaming connection fails mid-response, the StreamingToolExecutor has a discard() method that cleanly abandons all in-progress and queued tools. The system then falls back to executeNonStreamingRequest, carrying forward the consecutive 529 error count from the streaming attempt so the total retry budget is shared across both paths.

What Could Be Better

The 125KB monolith. claude.ts at 3,419 lines mixes HTTP client creation, streaming pipeline logic, retry/backoff strategy, beta header management, cost tracking, model selection, provider configuration, and non-streaming fallback — all in a single file. These are distinct responsibilities with different rates of change. The StreamingToolExecutor shows what good separation looks like: a focused class in its own file, with a clear interface.

Error recovery boundaries are implicit. When a streaming request fails mid-stream, the coordination between the streaming layer detecting the failure, calling discard() on the StreamingToolExecutor, and falling back to a non-streaming request spans multiple functions across multiple files. Making these transitions explicit — perhaps as a finite state machine — would improve debuggability.

Watch Out: The retry policy deserves its own module. The retry configuration — which query sources are foreground, what the backoff parameters are, how persistent retry mode works — is spread between withRetry.ts and claude.ts. A unified retry configuration that lives in one place would make it easier to reason about system behavior during capacity events.

The Takeaway

Make streaming your default path, not an optimization you add later. The architecture difference is fundamental.

When you start with batch responses and retrofit streaming, you end up with two parallel code paths that diverge over time — different error handling, different tool execution timing, different UX guarantees. When you start with streaming and treat batch as the fallback, every component in your system naturally handles progressive data. Your tool executor starts tools immediately. Your retry logic can yield partial results before retrying. Your UI never has a “loading” state that lasts more than 50 milliseconds.

The key patterns from Claude Code’s implementation:

  1. Use async generators as your streaming primitive. They compose naturally, handle backpressure, and integrate with TypeScript’s type system. AsyncGenerator<StreamEvent | AssistantMessage> is both the transport mechanism and the API contract.

  2. Execute tools as they are parsed, not after the response completes. The StreamingToolExecutor pattern of queuing tools from the stream and processing them concurrently is reusable in any tool-using AI application.

  3. Differentiate retry behavior by query importance. Not all requests deserve the same retry budget. Background requests that fail invisibly should bail fast during capacity events to reduce system-wide amplification.

  4. Keep your streaming pipeline decomposed. A single file that handles connection management, event parsing, tool detection, retry logic, and cost tracking will eventually collapse under its own weight. Separate these by rate of change.


This is Part 4 of the “Inside Claude Code” series.

← Part 3: The Tool System | Part 5: Permissions →