Context Management¶
Context management is one of the most important practical challenges in building agentic systems. Every LLM call has a context window — a maximum number of tokens it can process — and managing what goes into that window directly affects agent quality, cost, and reliability.
The Core Problem¶
graph TD
A[Conversation starts] --> B[Messages accumulate]
B --> C{Token count vs.<br/>context window?}
C -->|Under limit| D[All good — send everything]
C -->|Approaching limit| E[Need a strategy]
E --> E1[Compact/summarize]
E --> E2[Sliding window]
E --> E3[Structured state]
E --> E4[External storage]
The Chat Completions API is stateless — you send the full conversation on every call. As conversations grow, you hit limits:
- GPT-4o-mini: 128K tokens context window
- GPT-4o: 128K tokens context window
- Cost: You pay for all input tokens on every call
Four Context Strategies¶
1. Full Message History¶
Pass the entire conversation. Simple and lossless.
Used in: Single Agent, Group Chat (brainstorm)
Trade-offs: Simple but expensive and can hit token limits on long conversations.
2. Fresh Context Per Stage¶
Each agent gets a clean conversation with only the previous agent's output.
# Agent B gets only Agent A's output, not Agent A's full conversation
messages = [
{"role": "system", "content": agent_b_prompt},
{"role": "user", "content": agent_a_output}, # Just the output text
]
Used in: Sequential pattern
Trade-offs: Token-efficient but earlier context is lost. Must explicitly pass anything a later agent needs.
3. Structured State Objects¶
Pass a dataclass or Pydantic model between agents instead of raw text.
@dataclass
class HandoffContext:
customer_query: str
category: str
priority: str
extracted_info: dict
Used in: Handoff pattern, Magentic pattern (TaskLedger)
Trade-offs: Explicit and type-safe. Requires upfront design of the state schema. Most robust for complex workflows.
4. Task-Specific Context¶
The manager curates what each worker sees — only their specific task, not the full state.
# Worker sees only incident description + their specific task
worker_messages = [
{"role": "system", "content": worker_prompt},
{"role": "user", "content": f"Incident: {description}\nYour task: {task.description}"},
]
Used in: Magentic pattern
Trade-offs: Prevents information overload but requires the manager to be a good information curator.
Strategy by Pattern¶
| Pattern | Strategy | What Each Agent Sees |
|---|---|---|
| Single Agent | Full history | Everything (one conversation thread) |
| Sequential | Fresh per stage | Only previous agent's output |
| Concurrent | Independent | Same initial input (no sharing) |
| Group Chat | Shared accumulating | Full conversation (all agents' messages) |
| Handoff | Structured object | HandoffContext dataclass |
| Magentic | Task-specific | Incident context + specific task only |
Compaction: When History Gets Too Long¶
When a conversation exceeds a comfortable token budget, you can compact the history:
def compact_history(messages: list[dict], client, model: str) -> list[dict]:
"""Summarize older messages to save tokens."""
# Keep system prompt and recent messages
system = messages[0]
recent = messages[-6:] # Last 3 exchanges
# Summarize everything in between
middle = messages[1:-6]
if not middle:
return messages
summary_text = "".join(m["content"] for m in middle if m.get("content"))
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Summarize this conversation concisely."},
{"role": "user", "content": summary_text},
],
max_tokens=200,
)
summary = response.choices[0].message.content
return [system, {"role": "assistant", "content": f"[Summary of earlier conversation]: {summary}"}] + recent
Compaction is a trade-off
Summarization loses detail. Use it when you're approaching token limits, not proactively. For short exercises in this workshop, you won't need it.
Token Budgeting¶
A practical approach to token management:
- Reserve tokens for the system prompt (~500-1000 tokens)
- Reserve tokens for the model's response (
max_tokensparameter) - Budget remaining tokens for conversation history
- Monitor usage via
response.usage.prompt_tokens - Compact when approaching 75% of context window
Key Takeaways¶
- Context management determines agent quality — too little context and agents forget, too much and they get confused
- Choose the strategy that matches your pattern (fresh per stage, shared, structured, task-specific)
- Structured state objects (dataclasses) are the most robust inter-agent communication mechanism
- Compaction (summarization) is available when history gets too long
- Monitor token usage — it affects both cost and reliability