tutorialintermediate

Session memory compaction Jan 2026 • Agent Patterns Responses Manage long-running Claude conversations with instant session memory compaction using background threading and prompt caching.

March 8, 2026cookbook

This cookbook teaches developers how to manage long-running Claude conversations by implementing session memory compaction using background threading and prompt caching. Rather than waiting for context limits to be exceeded (reactive approach), the pattern enables instant compaction by proactively building summaries in the background. The guide covers writing effective session memory prompts, implementing background threading for zero-latency compaction, and applying prompt caching to reduce costs by ~80%. It includes Python code examples demonstrating both traditional (slow) and instant (fast) compaction strategies for conversational applications.

Key Points

•Implement proactive background memory compaction instead of reactive approaches—generate summaries before context limits are hit to eliminate user wait time
•Use prompt caching on session memory summaries to reduce API costs by approximately 80% through message prefix reuse
•Structure session memory prompts with analysis instructions, summary format sections (User Intent, Completed Work, Errors & Corrections, Active Work, Pending Tasks, Key References), and preservation rules
•Always preserve exact identifiers (IDs, paths, URLs, keys), error messages verbatim, user corrections, specific values, and the precise state of in-progress work during compression
•Weight recent messages more heavily during compression—omit pleasantries and filler while prioritizing user corrections > errors > active work > completed work
•Apply cache_control with ephemeral type to the last user message in the message list to enable prompt caching for background memory updates
•Choose between traditional compaction (single summary when context full) and instant compaction (continuous background updates) based on use case requirements
•Distinguish between explicitly requested tasks and implied/assumed tasks in pending items to maintain clarity on user expectations
•Use background threading to generate summaries asynchronously so compaction is transparent to the user experience
•Estimate token usage and implement context limit thresholds (e.g., 10,000 tokens) to trigger compaction before hitting hard API limits

Found this useful? Add it to a playbook for a step-by-step implementation guide.

Workflow Diagram

Start Process

Step A

Step B

Step C

Complete

Quality★★★★★

Concepts

Prompt Injection Defense Coding Assistance Context Management Memory Systems Tool Use

Artifacts (3)

session_memory_prompttemplate

Compress the conversation into a structured summary that preserves all information needed to continue work seamlessly. Optimize for the assistant's ability to continue working, not human readability.

<analysis-instructions>
Before generating your summary, analyze the transcript in <think>...</think> tags:
1. What did the user originally request? (Exact phrasing)
2. What actions succeeded? What failed and why?
3. Did the user correct or redirect the assistant at any point?
4. What was actively being worked on at the end?
5. What tasks remain incomplete or pending?
6. What specific details (IDs, paths, values, names) must survive compression?
</analysis-instructions>

<summary-format>
## User Intent
The user's original request and any refinements. Use direct quotes for key requirements.

## Completed Work
Actions successfully performed with exact identifiers and values.

## Errors & Corrections
Problems encountered, failed approaches, and user corrections verbatim.

## Active Work
What was in progress with direct quotes showing exactly where work left off.

## Pending Tasks
Remaining items distinguished between explicitly requested and implied.

## Key References
Identifiers, values, context, and citations needed to continue.
</summary-format>

<preserve-rules>
Always preserve: exact identifiers, error messages verbatim, user corrections, specific values, technical constraints, precise state of in-progress work.
</preserve-rules>

<compression-rules>
Weight recent messages heavily, omit pleasantries, keep sections under 500 words, prioritize: user corrections > errors > active work > completed work.
</compression-rules>

helper_functionspythonscript

import anthropic
from anthropic.types import MessageParam, TextBlockParam
import re

def truncate_response(text: str, max_lines: int = 15) -> str:
    """Truncate long responses for cleaner output display."""
    lines = text.strip().split("\n")
    if len(lines) <= max_lines:
        return text
    return "\n".join(lines[:max_lines]) + f"\n... ({len(lines) - max_lines} more lines)"

def remove_thinking_blocks(text: str) -> tuple[str, str]:
    """Remove <think>...</think> blocks from the text."""
    matches = re.findall(r"<think>.*?</think>", text, flags=re.DOTALL)
    cleaned = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
    return cleaned, "".join(matches)

def add_cache_control(messages: list[dict]) -> list[MessageParam]:
    """Add cache_control to the last user message for prompt caching."""
    cached_messages: list[MessageParam] = []
    last_user_idx = None
    
    for i, msg in enumerate(messages):
        if msg["role"] == "user":
            last_user_idx = i
    
    for i, msg in enumerate(messages):
        content = msg["content"]
        text = content if isinstance(content, str) else content[0]["text"]
        content_block: TextBlockParam = {"type": "text", "text": text}
        
        if i == last_user_idx:
            content_block["cache_control"] = {"type": "ephemeral"}
        
        cached_messages.append({
            "role": msg["role"],
            "content": [content_block]
        })
    
    return cached_messages

def estimate_tokens(text: str) -> int:
    """Rudimentary token estimation: 1 token per 4 characters."""
    return len(text) // 4

traditional_compacting_chat_sessionpythonscript

import time

class TraditionalCompactingChatSession:
    """Traditional chat session with compaction after the fact."""
    
    def __init__(self, system_message="You are a helpful assistant", context_limit: int = 10000):
        self.system_message = system_message
        self.context_limit = context_limit
        self.messages = []
        self.token_count = 0
    
    def add_message(self, role: str, content: str):
        """Add a message and check if compaction is needed."""
        self.messages.append({"role": role, "content": content})
        self.token_count += estimate_tokens(content)
        
        if self.token_count >= self.context_limit:
            self._compact_memory()
    
    def _compact_memory(self):
        """Generate summary when context limit is reached (USER WAITS)."""
        print("Context limit reached. Generating summary...")
        start_time = time.time()
        
        # Simulate API call to generate summary
        summary = self._generate_summary()
        
        elapsed = time.time() - start_time
        print(f"Summary generated in {elapsed:.2f}s (user experienced wait)")
        
        # Replace old messages with summary
        self.messages = [
            {"role": "user", "content": f"Previous context summary:\n{summary}"},
            {"role": "assistant", "content": "Context loaded."}
        ]
        self.token_count = estimate_tokens(summary) + 50
    
    def _generate_summary(self) -> str:
        """Generate a summary of the conversation."""
        # Placeholder for actual summary generation
        return "Conversation summary placeholder"