Inside Claude Code: Remembering What Matters

Your AI Has Amnesia Every 200,000 Tokens

Every conversation starts fresh, and every long conversation eventually hits a wall. The model cannot see beyond its context window. When the window fills up, something has to go. The question is: what do you keep, and what do you forget?

This is the context management problem, and it is one of the hardest design challenges in production AI tooling. Get it wrong, and your agent forgets the file it just edited, the bug it already diagnosed, or the architectural decision the user made forty messages ago. Get it right, and your agent feels like a colleague with a good notebook — someone who tracks the important things, lets go of the noise, and picks up exactly where they left off.

I found that Claude Code solves this problem with a layered system: carefully assembled context at the start of every conversation, automatic compaction when the window gets full, persistent session memory that survives across compactions, and a background “dreaming” process that consolidates learnings across sessions. Let me walk through each layer.

The Problem: Linear Growth, Finite Window

A conversation with Claude Code is not just chat messages. Every turn adds system information, user messages, assistant responses, tool calls, and tool results. A single file read might add thousands of tokens. A bash command with verbose output can add tens of thousands. The context window is large — 200K tokens for most models — but it is not infinite, and conversations that involve real engineering work (debugging, refactoring, exploring a codebase) can burn through it faster than you might expect.

The core tension: early messages often contain the most important context. The user’s original request, the architectural decisions, the files that were identified as relevant — these tend to appear in the first few turns. But as the conversation grows, these early messages are the ones furthest from the “head” of the context. Without intervention, the model loses access to the foundational decisions that should be guiding every subsequent action.

How Claude Code Solves It

Architecture Flow

See the diagram above for a visual overview of this flow.

Try the Interactive SimulationFull View →

Layer 1: Context Assembly

Every API call to Claude begins with the same question: what goes into the prompt? Claude Code assembles context from multiple sources before the conversation history even starts.

The assembly happens in src/context.ts, where two memoized functions — getSystemContext() and getUserContext() — gather the environment:

System context includes git status (current branch, recent commits, working tree state), truncated to 2,000 characters to prevent large repos from dominating the window
User context includes CLAUDE.md files (global from ~/.claude/CLAUDE.md and project-level), the current date, and any injected memory files
Conversation history is the actual back-and-forth between user and assistant, including all tool calls and their results

Context Assembly

System Context

OS / Shell / Env

Git Status

→

User Context

Global CLAUDE.md

Project CLAUDE.md

Current Date

Memory Files

→

Assembly

System Prompt + Tools

Conversation History

↓

API Prompt to Claude

What’s interesting is that both getSystemContext and getUserContext are wrapped in lodash memoize. They run once and cache the result for the lifetime of the session. This means git status reflects the state at conversation start, not the current state. The memoization prevents redundant expensive shell calls, but it also means context can become stale. The setSystemPromptInjection function explicitly calls cache.clear() on both memoized functions when it needs to force a refresh — a pattern that works but is fragile.

Layer 2: Compaction — The Sawtooth Pattern

As the conversation grows, token usage increases roughly linearly. At some point, it crosses a threshold, and Claude Code triggers compaction: a process that summarizes older messages into a shorter form, freeing up space for new ones.

The result is a sawtooth pattern:

Architecture Flow

See the diagram above for a visual overview of this flow.

Try the Interactive SimulationFull View →

Compaction Sawtooth Pattern

Cycle 1

Tokens grow linearly

→

Threshold reached

→

Compaction triggered

→

Tokens drop

→

Cycle 2

Tokens grow again

→

Threshold reached

→

Compaction triggered

→

Tokens drop

The threshold logic lives in src/services/compact/autoCompact.ts. The getAutoCompactThreshold() function calculates when to trigger based on the effective context window minus a buffer (AUTOCOMPACT_BUFFER_TOKENS = 13,000). The effective window itself is the model’s context window minus reserved tokens for the summary output (MAX_OUTPUT_TOKENS_FOR_SUMMARY = 20,000). This double subtraction ensures there is always room for both the compaction summary and a few more turns of conversation after compaction completes.

The calculateTokenWarningState() function provides graduated pressure levels:

Warning threshold: context is getting full, UI shows a yellow indicator
Error threshold: context is dangerously full, UI shows a red indicator
Auto-compact threshold: compaction triggers automatically
Blocking limit: conversation is blocked until compaction completes

✅ Smart Pattern: This graduated approach means the system does not jump from “everything is fine” to “emergency compaction.” The user sees pressure building and can manually compact (via /compact) before the automatic trigger fires.

Layer 3: How Compaction Actually Works

When auto-compact triggers, the process is surprisingly sophisticated. The key insight: compaction runs through a forked agent context with isolated history and tool permissions.

Compaction Sequence

User

Sends message to Main Agent

Main

Check token count vs threshold

→

Above threshold!

Fork

Spawn forked agent context

Strip images

→

Strip attachments

→

Send to API: “summarize this”

API

Returns compressed summary

Fork

Replace old msgs with summary

→

Re-inject files (top 5)

→

Re-inject skills

Main

Swap in compacted history

Conversation continues with smaller context

The compactConversation() function orchestrates the flow. Before summarization, it strips images (stripImagesFromMessages) and re-injected attachments — these are not needed for the summary and would waste tokens. After compaction, it re-injects the most recently read files (up to POST_COMPACT_MAX_FILES_TO_RESTORE = 5, budgeted at POST_COMPACT_TOKEN_BUDGET = 50,000 tokens) and skill definitions (budgeted separately at POST_COMPACT_SKILLS_TOKEN_BUDGET = 25,000 tokens). This ensures the model still knows about the files it was actively working with, even though the conversation history that originally contained those file reads has been summarized away.

⚠️ Watch Out: The circuit breaker is worth highlighting. The AutoCompactTrackingState type includes a consecutiveFailures counter. After MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3 failures, auto-compact stops trying. A comment in the source explains: “1,279 sessions had 50+ consecutive failures (up to 3,272) in a single session, wasting ~250K API calls/day globally.”

Layer 4: Session Memory — Surviving Compaction

Compaction solves the space problem but creates an information problem. The summary is lossy. Details about specific files, exact error messages, user preferences expressed early in the conversation — these can be compressed away. Session memory is the answer.

src/services/SessionMemory/sessionMemory.ts implements a background process that periodically extracts key information from the conversation into a structured markdown file. The extraction runs as (yet another) forked agent, triggered when two conditions are met:

Token threshold: at least minimumMessageTokensToInit = 10,000 tokens have been used
Growth threshold: at least minimumTokensBetweenUpdate = 5,000 tokens of new context added since last extraction, AND at least toolCallsBetweenUpdates = 3 tool calls have occurred

The template structure is deliberate:

Session Title — a distinctive 5-10 word description
Current State — what is actively being worked on right now
Task Specification — what the user asked to build
Files and Functions — important files and why they matter
Workflow — bash commands and their typical order
Errors and Corrections — what failed and what the user corrected
Codebase and System Documentation — how components fit together
Learnings — what worked, what did not, what to avoid
Key Results — exact outputs the user requested
Worklog — step-by-step summary of what was done

Each section is capped at roughly 2,000 tokens, with the total session memory budget at MAX_TOTAL_SESSION_MEMORY_TOKENS = 12,000. The “Current State” section is particularly clever: it is the first thing a post-compaction model reads to re-orient itself.

Layer 5: Dreaming — Cross-Session Consolidation

Session memory operates within a single conversation. But what about knowledge that spans sessions? If you worked on a codebase yesterday and come back today, how does the agent know what it learned?

The “dream” system handles cross-session memory consolidation:

Dream Lifecycle

Idle

→

Gate Check

→

Starting

→

Updating

→

Complete

→

Idle

Fail path:

Starting / Updating

→

Failed

→

Idle (lock rolled back)

Gate Check conditions:Time: hours since last >= 24 | Sessions: transcripts since last >= 5 | Lock: no other process dreaming

The gating is designed to be cheap. The check runs on every post-sampling hook (after every model response), but it short-circuits early. First it checks a time gate: has it been at least minHours = 24 hours since the last consolidation? Then a session gate: have at least minSessions = 5 new session transcripts been created since then? Only if both pass does it attempt to acquire a filesystem lock (preventing concurrent dreams across terminal tabs).

The dream prompt itself is a four-phase process:

Orient: Read the existing memory directory and index to understand what is already known
Gather: Look for new information worth persisting — from daily logs, existing memories that have drifted from reality, and narrow transcript searches
Consolidate: Write or update memory files, merging new signal into existing topics rather than creating duplicates
Prune and Index: Keep the entrypoint index file concise, remove stale pointers, resolve contradictions

The Memory Retention Strategy

What I found most impressive is how these layers form a coherent retention strategy:

Memory Retention Strategy

Hot: Active conversationFull detail, most recent

Compaction ↓ Extraction ↓ Re-injects files ↑

↓

Warm: Session memoryStructured notes, 12K tokens

Feeds into compaction

↓

Cool: Compaction summaryLossy compression of old turns

↓

Cold: Dream memoriesCross-session, on disk

Loaded at startup → Hot

↓

Archive: Session transcriptsJSONL on disk, searchable

Dream reads → Cold

Forked agents for isolated memory work. Compaction, session memory extraction, and dreaming all use forked-agent contexts with constrained tools and separate histories. This isolation prevents memory operations from polluting the main conversation.

Structured session memory template. The session memory is not a freeform dump — it is a structured document with defined sections. This means the extraction agent produces consistent, scannable output.

Image stripping before compaction. Sending images to the summarization model is wasteful — they add significant token cost but contribute nothing to a text summary. The stripImagesFromMessages function replaces them with [image] markers.

What Could Be Better

Memoization fragility. Both getSystemContext and getUserContext use lodash memoize with manual cache.clear() calls. There is no TTL, no automatic invalidation, no staleness detection. If someone adds a new reason to refresh context and forgets to add a cache.clear() call, the system silently serves stale data.

Threshold-based, not priority-based compaction. Compaction triggers when the token count crosses a threshold, and it summarizes the oldest messages first. But “oldest” and “least important” are not the same thing. The user’s original task description, stated in message 1, might be more important than a file read in message 50. A priority-aware compaction system could tag messages with importance scores and preserve high-priority messages longer, even if they are older.

Session memory extraction is time-based, not event-based. The extraction triggers on token growth and tool call counts, not on the semantic importance of what just happened. A user could make a critical architectural decision that is not extracted for several more turns simply because the token threshold has not been met.

Dream frequency is wall-clock-based. The 24-hour gate means that a developer who has five intensive sessions in one afternoon will not get dream consolidation until the next day.

The Takeaway

Treat context as a precious resource — have explicit strategies for retention priority, not just size-based eviction.

Claude Code’s context management is a case study in pragmatic engineering. The system works remarkably well in practice: conversations can run for hours, compaction happens transparently, session memory preserves critical state, and dreams consolidate learnings across sessions. The architecture is layered, each layer addressing a different time horizon (current turn, current session, cross-session, permanent).

But the design also reveals the fundamental tension in all context management systems: you are making lossy compression decisions about information whose future relevance you cannot predict. The current system uses position (oldest first) and structure (predefined memory sections) as proxies for importance. The next generation will likely use learned importance signals — attention patterns, user corrections, task completion rates — to make smarter retention decisions.

For your own AI applications, the lesson is clear. Do not wait until the context window is full to think about memory management. Design your retention strategy upfront: what categories of information matter most? How will you detect when important context is about to be evicted? What survives compaction, and what is allowed to fade? The answers to these questions will determine whether your AI agent feels like a thoughtful colleague or a goldfish with a keyboard.

This is Part 7 of the “Inside Claude Code” series.

← Part 6: Plugging Into Everything with MCP | Part 8: React in Your Terminal →