/ 22 min read

Inside Claude Code: Remembering What Matters

Context window management, compaction via forked agents, session memory extraction, and the art of forgetting gracefully. Part 7 of 10.

Your AI Has Amnesia Every 200,000 Tokens

Every conversation starts fresh, and every long conversation eventually hits a wall. The model cannot see beyond its context window. When the window fills up, something has to go. The question is: what do you keep, and what do you forget?

This is the context management problem, and it is one of the hardest design challenges in production AI tooling. Get it wrong, and your agent forgets the file it just edited, the bug it already diagnosed, or the architectural decision the user made forty messages ago. Get it right, and your agent feels like a colleague with a good notebook — someone who tracks the important things, lets go of the noise, and picks up exactly where they left off.

I found that Claude Code solves this problem with a layered system: carefully assembled context at the start of every conversation, automatic compaction when the window gets full, persistent session memory that survives across compactions, and a background “dreaming” process that consolidates learnings across sessions. Let me walk through each layer.

The Problem: Linear Growth, Finite Window

A conversation with Claude Code is not just chat messages. Every turn adds system information, user messages, assistant responses, tool calls, and tool results. A single file read might add thousands of tokens. A bash command with verbose output can add tens of thousands. The context window is large — 200K tokens for most models — but it is not infinite, and conversations that involve real engineering work (debugging, refactoring, exploring a codebase) can burn through it faster than you might expect.

The core tension: early messages often contain the most important context. The user’s original request, the architectural decisions, the files that were identified as relevant — these tend to appear in the first few turns. But as the conversation grows, these early messages are the ones furthest from the “head” of the context. Without intervention, the model loses access to the foundational decisions that should be guiding every subsequent action.

How Claude Code Solves It

Architecture Flow

See the diagram above for a visual overview of this flow.

Try the Interactive SimulationFull View →

Layer 1: Context Assembly

Every API call to Claude begins with the same question: what goes into the prompt? Claude Code assembles context from multiple sources before the conversation history even starts.

The assembly happens in src/context.ts, where two memoized functions — getSystemContext() and getUserContext() — gather the environment:

  • System context includes git status (current branch, recent commits, working tree state), truncated to 2,000 characters to prevent large repos from dominating the window
  • User context includes CLAUDE.md files (global from ~/.claude/CLAUDE.md and project-level), the current date, and any injected memory files
  • Conversation history is the actual back-and-forth between user and assistant, including all tool calls and their results
Context Assembly
System Context
OS / Shell / Env
Git Status
User Context
Global CLAUDE.md
Project CLAUDE.md
Current Date
Memory Files
Assembly
System Prompt + Tools
Conversation History
API Prompt to Claude

What’s interesting is that both getSystemContext and getUserContext are wrapped in lodash memoize. They run once and cache the result for the lifetime of the session. This means git status reflects the state at conversation start, not the current state. The memoization prevents redundant expensive shell calls, but it also means context can become stale. The setSystemPromptInjection function explicitly calls cache.clear() on both memoized functions when it needs to force a refresh — a pattern that works but is fragile.

Layer 2: Compaction — The Sawtooth Pattern

As the conversation grows, token usage increases roughly linearly. At some point, it crosses a threshold, and Claude Code triggers compaction: a process that summarizes older messages into a shorter form, freeing up space for new ones.

The result is a sawtooth pattern:

Architecture Flow

See the diagram above for a visual overview of this flow.

Try the Interactive SimulationFull View →
Compaction Sawtooth Pattern
Cycle 1
Tokens grow linearly
Threshold reached
Compaction triggered
Tokens drop
Cycle 2
Tokens grow again
Threshold reached
Compaction triggered
Tokens drop

The threshold logic lives in src/services/compact/autoCompact.ts. The getAutoCompactThreshold() function calculates when to trigger based on the effective context window minus a buffer (AUTOCOMPACT_BUFFER_TOKENS = 13,000). The effective window itself is the model’s context window minus reserved tokens for the summary output (MAX_OUTPUT_TOKENS_FOR_SUMMARY = 20,000). This double subtraction ensures there is always room for both the compaction summary and a few more turns of conversation after compaction completes.

The calculateTokenWarningState() function provides graduated pressure levels:

  • Warning threshold: context is getting full, UI shows a yellow indicator
  • Error threshold: context is dangerously full, UI shows a red indicator
  • Auto-compact threshold: compaction triggers automatically
  • Blocking limit: conversation is blocked until compaction completes

Smart Pattern: This graduated approach means the system does not jump from “everything is fine” to “emergency compaction.” The user sees pressure building and can manually compact (via /compact) before the automatic trigger fires.

Layer 3: How Compaction Actually Works

When auto-compact triggers, the process is surprisingly sophisticated. The key insight: compaction runs through a forked agent context with isolated history and tool permissions.

Compaction Sequence
User
Sends message to Main Agent
Main
Check token count vs threshold
Above threshold!
Fork
Spawn forked agent context
Strip images
Strip attachments
Send to API: “summarize this”
API
Returns compressed summary
Fork
Replace old msgs with summary
Re-inject files (top 5)
Re-inject skills
Main
Swap in compacted history
Conversation continues with smaller context

The compactConversation() function orchestrates the flow. Before summarization, it strips images (stripImagesFromMessages) and re-injected attachments — these are not needed for the summary and would waste tokens. After compaction, it re-injects the most recently read files (up to POST_COMPACT_MAX_FILES_TO_RESTORE = 5, budgeted at POST_COMPACT_TOKEN_BUDGET = 50,000 tokens) and skill definitions (budgeted separately at POST_COMPACT_SKILLS_TOKEN_BUDGET = 25,000 tokens). This ensures the model still knows about the files it was actively working with, even though the conversation history that originally contained those file reads has been summarized away.

⚠️ Watch Out: The circuit breaker is worth highlighting. The AutoCompactTrackingState type includes a consecutiveFailures counter. After MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 failures, auto-compact stops trying. A comment in the source explains: “1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls/day globally.”

Layer 4: Session Memory — Surviving Compaction

Compaction solves the space problem but creates an information problem. The summary is lossy. Details about specific files, exact error messages, user preferences expressed early in the conversation — these can be compressed away. Session memory is the answer.

src/services/SessionMemory/sessionMemory.ts implements a background process that periodically extracts key information from the conversation into a structured markdown file. The extraction runs as (yet another) forked agent, triggered when two conditions are met:

  1. Token threshold: at least minimumMessageTokensToInit = 10,000 tokens have been used
  2. Growth threshold: at least minimumTokensBetweenUpdate = 5,000 tokens of new context added since last extraction, AND at least toolCallsBetweenUpdates = 3 tool calls have occurred

The template structure is deliberate:

  • Session Title — a distinctive 5-10 word description
  • Current State — what is actively being worked on right now
  • Task Specification — what the user asked to build
  • Files and Functions — important files and why they matter
  • Workflow — bash commands and their typical order
  • Errors and Corrections — what failed and what the user corrected
  • Codebase and System Documentation — how components fit together
  • Learnings — what worked, what did not, what to avoid
  • Key Results — exact outputs the user requested
  • Worklog — step-by-step summary of what was done

Each section is capped at roughly 2,000 tokens, with the total session memory budget at MAX_TOTAL_SESSION_MEMORY_TOKENS = 12,000. The “Current State” section is particularly clever: it is the first thing a post-compaction model reads to re-orient itself.

Layer 5: Dreaming — Cross-Session Consolidation

Session memory operates within a single conversation. But what about knowledge that spans sessions? If you worked on a codebase yesterday and come back today, how does the agent know what it learned?

The “dream” system handles cross-session memory consolidation:

Dream Lifecycle
Idle
Gate Check
Starting
Updating
Complete
Idle
Fail path:
Starting / Updating
Failed
Idle (lock rolled back)
Gate Check conditions:Time: hours since last >= 24  |  Sessions: transcripts since last >= 5  |  Lock: no other process dreaming

The gating is designed to be cheap. The check runs on every post-sampling hook (after every model response), but it short-circuits early. First it checks a time gate: has it been at least minHours = 24 hours since the last consolidation? Then a session gate: have at least minSessions = 5 new session transcripts been created since then? Only if both pass does it attempt to acquire a filesystem lock (preventing concurrent dreams across terminal tabs).

The dream prompt itself is a four-phase process:

  1. Orient: Read the existing memory directory and index to understand what is already known
  2. Gather: Look for new information worth persisting — from daily logs, existing memories that have drifted from reality, and narrow transcript searches
  3. Consolidate: Write or update memory files, merging new signal into existing topics rather than creating duplicates
  4. Prune and Index: Keep the entrypoint index file concise, remove stale pointers, resolve contradictions

The Memory Retention Strategy

What I found most impressive is how these layers form a coherent retention strategy:

Memory Retention Strategy
Hot: Active conversationFull detail, most recent
Compaction ↓   Extraction ↓   Re-injects files ↑
Warm: Session memoryStructured notes, 12K tokens
Feeds into compaction
Cool: Compaction summaryLossy compression of old turns
Cold: Dream memoriesCross-session, on disk
Loaded at startup → Hot
Archive: Session transcriptsJSONL on disk, searchable
Dream reads → Cold

Forked agents for isolated memory work. Compaction, session memory extraction, and dreaming all use forked-agent contexts with constrained tools and separate histories. This isolation prevents memory operations from polluting the main conversation.

Structured session memory template. The session memory is not a freeform dump — it is a structured document with defined sections. This means the extraction agent produces consistent, scannable output.

Image stripping before compaction. Sending images to the summarization model is wasteful — they add significant token cost but contribute nothing to a text summary. The stripImagesFromMessages function replaces them with [image] markers.

What Could Be Better

Memoization fragility. Both getSystemContext and getUserContext use lodash memoize with manual cache.clear() calls. There is no TTL, no automatic invalidation, no staleness detection. If someone adds a new reason to refresh context and forgets to add a cache.clear() call, the system silently serves stale data.

Threshold-based, not priority-based compaction. Compaction triggers when the token count crosses a threshold, and it summarizes the oldest messages first. But “oldest” and “least important” are not the same thing. The user’s original task description, stated in message 1, might be more important than a file read in message 50. A priority-aware compaction system could tag messages with importance scores and preserve high-priority messages longer, even if they are older.

Session memory extraction is time-based, not event-based. The extraction triggers on token growth and tool call counts, not on the semantic importance of what just happened. A user could make a critical architectural decision that is not extracted for several more turns simply because the token threshold has not been met.

Dream frequency is wall-clock-based. The 24-hour gate means that a developer who has five intensive sessions in one afternoon will not get dream consolidation until the next day.

The Takeaway

Treat context as a precious resource — have explicit strategies for retention priority, not just size-based eviction.

Claude Code’s context management is a case study in pragmatic engineering. The system works remarkably well in practice: conversations can run for hours, compaction happens transparently, session memory preserves critical state, and dreams consolidate learnings across sessions. The architecture is layered, each layer addressing a different time horizon (current turn, current session, cross-session, permanent).

But the design also reveals the fundamental tension in all context management systems: you are making lossy compression decisions about information whose future relevance you cannot predict. The current system uses position (oldest first) and structure (predefined memory sections) as proxies for importance. The next generation will likely use learned importance signals — attention patterns, user corrections, task completion rates — to make smarter retention decisions.

For your own AI applications, the lesson is clear. Do not wait until the context window is full to think about memory management. Design your retention strategy upfront: what categories of information matter most? How will you detect when important context is about to be evicted? What survives compaction, and what is allowed to fade? The answers to these questions will determine whether your AI agent feels like a thoughtful colleague or a goldfish with a keyboard.


This is Part 7 of the “Inside Claude Code” series.

← Part 6: Plugging Into Everything with MCP | Part 8: React in Your Terminal →