Context Engineering for AI Agents: The Complete Developer's Guide
Context Engineering for AI Agents: The Complete Developer's Guide
Context engineering has emerged as the defining skill for building production-ready AI agents in 2026. While prompt engineering focused on crafting the right instructions, context engineering tackles a more fundamental challenge: dynamically curating what information goes into an LLM's context window at each step of an agent's execution.
This guide covers everything you need to know about context engineering—from foundational concepts to advanced implementation patterns—with practical code examples and real-world architecture decisions.
What is Context Engineering?
Context engineering is the art and science of filling an LLM's context window with precisely the right information at each step of an agent's trajectory. As Andrej Karpathy put it: "The term 'prompt engineering' focused on the art of providing the right instructions. Context engineering puts more focus on filling the context window with the most relevant information, wherever that information may come from."
The distinction matters because modern AI agents operate over multiple inference turns across extended time horizons. Each turn generates new data that could be relevant for the next decision. Context engineering is about cyclically refining what gets passed to the model from that constantly evolving universe of possible information.
The Evolution from Prompt Engineering
In the early days of LLM engineering, prompting was the primary focus. Most use cases required prompts optimized for one-shot classification or text generation tasks. The work centered on how to write effective system prompts.
But agents are different. An agent running in a loop generates progressively more data—tool outputs, retrieved documents, conversation history, intermediate reasoning. Context engineering addresses the question: which of these tokens should make it into the next inference call?
Consider a coding agent navigating a large codebase. A prompt engineer might craft instructions like "analyze this code carefully." A context engineer designs systems that:
- Index the codebase for efficient retrieval
- Maintain lightweight file references rather than full contents
- Progressively load relevant files as the agent explores
- Summarize or compress older context to make room for new information
- Decide when to retrieve versus when to explore autonomously
This shift from static prompts to dynamic context curation is what separates toy demos from production agents.
Why Context Engineering Matters
The Attention Budget Problem
LLMs have finite attention. As context length increases, a model's ability to capture pairwise relationships between tokens gets stretched thin. Research on "context rot" shows that as tokens accumulate, the model's ability to accurately recall information from that context decreases.
This isn't a bug—it's an architectural reality. The transformer architecture enables every token to attend to every other token, creating n² pairwise relationships. This quadratic scaling means attention becomes a scarce resource that must be carefully allocated.
Anthropic's research frames this as an "attention budget." Every new token depletes this budget, increasing the need to curate carefully. The practical implication: you can't just dump everything into context and hope for the best.
Context Window Limits
Modern LLMs have impressive context windows—Claude supports 200K tokens, GPT-4 handles 128K—but these limits still constrain agent architectures. A single large codebase can easily exceed these limits. Long-running conversations accumulate tokens quickly.
Effective context engineering treats the context window as a finite resource with diminishing returns. The goal is finding the minimal set of high-signal tokens that maximize the probability of desired outcomes.
The Quality-Quantity Tradeoff
More context isn't always better. Studies show that LLMs perform worse when given irrelevant context, even if the total token count is within limits. The model's attention gets diluted across irrelevant information.
This creates a quality-quantity tradeoff: you want enough context to inform the agent's decisions, but not so much that important signals get lost in noise.
Components of Effective Context
Context engineering involves curating multiple sources of information. Let's examine each component and best practices for managing it.
System Prompts and Instructions
System prompts set the behavioral foundation for agents. The key is finding the "Goldilocks zone"—specific enough to guide behavior, flexible enough to handle variation.
Common failure modes:
- Over-specification: Hardcoding complex if-else logic that creates brittleness
- Under-specification: Vague instructions that assume shared context the model doesn't have
Best practices:
- Organize prompts into distinct sections using XML tags or Markdown headers
- Start minimal with the best model and add instructions based on observed failures
- Write for the task altitude—detailed enough to be actionable, general enough to be robust
<system>
<background>
You are a code review agent for Python repositories.
You have access to the codebase via file reading tools.
</background>
<instructions>
1. Start by understanding the PR scope from the diff
2. Load relevant test files and documentation
3. Check for security issues, performance problems, and maintainability
4. Provide specific, actionable feedback
</instructions>
<output_format>
Structure your review as:
- Summary (2-3 sentences)
- Critical Issues (blocking)
- Suggestions (non-blocking improvements)
- Questions (clarifications needed)
</output_format>
</system>
Tools and Their Definitions
Tools define the contract between agents and their information/action space. Tool descriptions become part of the context and directly affect agent behavior.
Best practices:
- Keep tool sets minimal—if a human engineer can't definitively say which tool applies, the agent won't do better
- Make tools self-contained and robust to errors
- Use descriptive parameter names that play to model strengths
from typing import Annotated
def search_codebase(
query: Annotated[str, "Natural language description of what to find"],
file_pattern: Annotated[str, "Glob pattern to filter files, e.g. '*.py'"] = "*",
max_results: Annotated[int, "Maximum files to return"] = 10
) -> str:
"""
Search the codebase for files matching a semantic query.
Use this when you need to find code related to a concept, function,
or pattern. Returns file paths with relevant snippets.
Examples:
- "authentication logic" → finds auth-related files
- "database connection handling" → finds DB connection code
"""
# implementation
Conversation History and Memory
For multi-turn interactions, conversation history provides crucial context. But raw history grows quickly and often contains redundant information.
Strategies for managing conversation context:
- Sliding window: Keep only the last N messages
- Summarization: Compress older messages into summaries
- Selective retrieval: Index history and retrieve relevant portions
- Tiered storage: Recent messages in full, older in summarized form
Retrieved Information (RAG Context)
Retrieval-augmented generation remains essential for grounding agents in external knowledge. But retrieval strategies need to evolve for agentic use cases.
Pre-retrieval (traditional RAG):
- Compute embeddings upfront
- Retrieve relevant chunks before inference
- Fast but can miss relevant context
Just-in-time retrieval (agentic RAG):
- Agent holds lightweight references (file paths, URLs, queries)
- Loads data into context as needed during execution
- Slower but more targeted
The hybrid approach often works best: pre-retrieve essential context, let the agent explore further as needed.
User Context and Personalization
For applications serving individual users, personal context dramatically improves relevance. This includes:
- User preferences and settings
- Historical interactions and feedback
- Demographic and behavioral patterns
- Current session state
This is where a personal context layer becomes essential. Rather than rebuilding user modeling for every application, developers can rely on infrastructure that maintains and retrieves user context across sessions.
Example with Dytto's Context API:
import requests
# Retrieve current user context
response = requests.get(
"https://dytto.app/api/context",
headers={"Authorization": f"Bearer {api_key}"}
)
context = response.json()
# Inject relevant context into system prompt
user_context = f"""
<user_context>
Name: {context['user']['name']}
Preferences: {context['preferences']}
Recent topics: {context['recent_topics']}
Communication style: {context['style_preferences']}
</user_context>
"""
system_prompt = base_instructions + user_context
The key insight: user context should be first-class in your context engineering strategy, not an afterthought. Tools like Dytto provide APIs specifically designed for injecting personal context into AI applications.
Context Engineering Patterns
Let's examine specific patterns that work in production agent architectures.
Pattern 1: Progressive Disclosure
Rather than loading everything upfront, let agents discover relevant context through exploration. Each interaction yields signals that inform the next decision.
Example: Codebase navigation
class CodebaseAgent:
def __init__(self, repo_path: str):
self.repo_path = repo_path
# Only store file paths, not contents
self.file_index = self._build_file_index()
self.loaded_files = {}
def _build_file_index(self) -> dict:
"""Build lightweight index of file paths and metadata."""
index = {}
for path in Path(self.repo_path).rglob("*.py"):
relative = path.relative_to(self.repo_path)
index[str(relative)] = {
"size": path.stat().st_size,
"modified": path.stat().st_mtime,
}
return index
def load_file(self, path: str) -> str:
"""Load file contents into context on demand."""
if path not in self.loaded_files:
full_path = Path(self.repo_path) / path
self.loaded_files[path] = full_path.read_text()
return self.loaded_files[path]
def get_context_for_llm(self) -> str:
"""Generate context string with loaded files."""
context_parts = ["<codebase_context>"]
context_parts.append(f"Available files: {len(self.file_index)}")
# Only include loaded files
for path, content in self.loaded_files.items():
context_parts.append(f"\n<file path='{path}'>\n{content}\n</file>")
context_parts.append("</codebase_context>")
return "\n".join(context_parts)
Pattern 2: Context Compression
When context grows too large, compress older or less relevant portions rather than discarding them entirely.
def compress_conversation_history(messages: list, llm_client) -> str:
"""Compress older messages while preserving recent ones."""
if len(messages) <= 10:
return format_messages(messages)
# Keep last 5 messages in full
recent = messages[-5:]
older = messages[:-5]
# Summarize older messages
summary_prompt = f"""
Summarize the key points from this conversation history:
{format_messages(older)}
Focus on: decisions made, information gathered, tasks completed.
Keep it under 200 words.
"""
summary = llm_client.complete(summary_prompt)
return f"""
<conversation_summary>
{summary}
</conversation_summary>
<recent_messages>
{format_messages(recent)}
</recent_messages>
"""
Pattern 3: Tiered Context Architecture
Design explicit tiers for different types of context with different update frequencies and retrieval strategies.
class TieredContextManager:
def __init__(self):
self.tiers = {
"system": { # Static, set once
"instructions": None,
"tool_definitions": None,
},
"session": { # Changes per session
"user_context": None,
"session_goals": None,
},
"working": { # Changes frequently
"recent_messages": [],
"tool_outputs": [],
"retrieved_docs": [],
},
"reference": { # Retrieval-based
"knowledge_base": None,
"codebase_index": None,
}
}
def build_context(self, query: str) -> str:
"""Assemble context from all tiers."""
parts = []
# System tier (always included)
parts.append(self.tiers["system"]["instructions"])
parts.append(self.format_tools(self.tiers["system"]["tool_definitions"]))
# Session tier (always included)
if self.tiers["session"]["user_context"]:
parts.append(self.tiers["session"]["user_context"])
# Working tier (last N items)
working = self.tiers["working"]
parts.append(self.format_messages(working["recent_messages"][-10:]))
parts.append(self.format_tool_outputs(working["tool_outputs"][-5:]))
# Reference tier (retrieved based on query)
relevant_docs = self.retrieve_relevant(query)
parts.append(self.format_documents(relevant_docs))
return "\n\n".join(filter(None, parts))
Pattern 4: Context State Machine
For complex agents, model context management as a state machine with explicit transitions.
from enum import Enum
from dataclasses import dataclass
class ContextState(Enum):
EXPLORATION = "exploration" # Broad context, many references
FOCUSED = "focused" # Narrow context, deep content
EXECUTION = "execution" # Minimal context, action-focused
REVIEW = "review" # Summary context, verification
@dataclass
class ContextConfig:
max_tokens: int
include_history: bool
include_references: bool
compression_level: str # "none", "light", "aggressive"
CONTEXT_CONFIGS = {
ContextState.EXPLORATION: ContextConfig(
max_tokens=50000,
include_history=True,
include_references=True,
compression_level="none"
),
ContextState.FOCUSED: ContextConfig(
max_tokens=100000,
include_history=False,
include_references=False,
compression_level="none"
),
ContextState.EXECUTION: ContextConfig(
max_tokens=20000,
include_history=False,
include_references=False,
compression_level="aggressive"
),
ContextState.REVIEW: ContextConfig(
max_tokens=30000,
include_history=True,
include_references=False,
compression_level="light"
),
}
Measuring Context Engineering Effectiveness
How do you know if your context engineering is working? Here are key metrics to track.
Operational Metrics
- Output variance: Lower variance in output quality across runs indicates stable context
- Rule adherence: Track how often agents follow specified constraints
- Human intervention rate: Fewer corrections needed = better context
- Token efficiency: Desired outcomes achieved with fewer tokens
Quality Metrics
- Task completion rate: Percentage of tasks completed successfully
- First-attempt success: How often does the agent succeed without retries
- Context utilization: Are retrieved documents actually used in responses
- Relevance scores: User ratings of response relevance
Implementation
class ContextMetrics:
def __init__(self):
self.runs = []
def log_run(self, run_data: dict):
self.runs.append({
"timestamp": datetime.now(),
"context_tokens": run_data["context_tokens"],
"output_tokens": run_data["output_tokens"],
"task_completed": run_data["task_completed"],
"human_intervention": run_data["human_intervention"],
"retrieved_docs_used": run_data["docs_used"] / run_data["docs_retrieved"]
})
def get_summary(self, window_days: int = 7) -> dict:
recent = [r for r in self.runs
if r["timestamp"] > datetime.now() - timedelta(days=window_days)]
return {
"completion_rate": sum(r["task_completed"] for r in recent) / len(recent),
"avg_context_tokens": sum(r["context_tokens"] for r in recent) / len(recent),
"intervention_rate": sum(r["human_intervention"] for r in recent) / len(recent),
"doc_utilization": sum(r["retrieved_docs_used"] for r in recent) / len(recent),
}
Common Pitfalls and How to Avoid Them
Pitfall 1: Context Stuffing
Problem: Dumping everything into context assuming more information is always better.
Solution: Treat context as a scarce resource. Implement explicit budgeting:
def budget_context(components: list, max_tokens: int) -> list:
"""Prioritize context components within a token budget."""
# Priority order (highest first)
priority = ["instructions", "user_context", "recent_messages",
"tool_outputs", "retrieved_docs"]
budget_remaining = max_tokens
included = []
for component_type in priority:
for component in components:
if component["type"] == component_type:
if component["tokens"] <= budget_remaining:
included.append(component)
budget_remaining -= component["tokens"]
return included
Pitfall 2: Stale Context
Problem: Context that was relevant earlier becomes outdated as the task evolves.
Solution: Implement context expiration and refresh mechanisms:
@dataclass
class ContextItem:
content: str
created_at: datetime
relevance_score: float
ttl_seconds: int = 3600 # 1 hour default
def filter_stale_context(items: list[ContextItem]) -> list[ContextItem]:
now = datetime.now()
return [
item for item in items
if (now - item.created_at).seconds < item.ttl_seconds
]
Pitfall 3: Ignoring User Context
Problem: Building agents that treat every user interaction as stateless.
Solution: Integrate personal context management into your architecture from the start. This is where tools like Dytto shine—they handle the complexity of maintaining, updating, and retrieving user context so you can focus on your core agent logic.
from dytto import DyttoClient
# Initialize once per application
dytto = DyttoClient(api_key=os.environ["DYTTO_API_KEY"])
async def handle_user_request(user_id: str, message: str):
# Fetch current user context
user_context = await dytto.get_context(user_id)
# Build prompt with user context
prompt = build_prompt(
system_instructions=SYSTEM_PROMPT,
user_context=user_context,
message=message
)
# Get response
response = await llm.complete(prompt)
# Update context with new information
await dytto.update_context(
user_id=user_id,
interaction={
"message": message,
"response": response,
"timestamp": datetime.now().isoformat()
}
)
return response
Pitfall 4: Over-Engineering Retrieval
Problem: Building complex retrieval pipelines when simple approaches would work.
Solution: Start simple and add complexity only when needed. Claude Code's approach is instructive: drop CLAUDE.md files directly into context upfront, then use grep and glob for exploration. No embeddings required.
Building Context-Aware Agents with Dytto
Dytto provides infrastructure specifically designed for context engineering challenges. Here's how to integrate it into your agent architecture.
Setting Up User Context
from dytto import DyttoClient, ContextSchema
# Define what context you want to track
schema = ContextSchema(
track_preferences=True,
track_history=True,
track_patterns=True,
custom_fields={
"projects": "list",
"expertise_areas": "list",
"communication_style": "string"
}
)
dytto = DyttoClient(
api_key=os.environ["DYTTO_API_KEY"],
schema=schema
)
# Store new context about a user
await dytto.store_context(
user_id="user_123",
context={
"projects": ["mobile-app-redesign", "api-migration"],
"expertise_areas": ["Python", "React", "system-design"],
"communication_style": "concise, technical"
}
)
Retrieving Context for Agent Use
async def build_agent_context(user_id: str, task: str) -> str:
# Get full user context
user_data = await dytto.get_context(user_id)
# Get task-relevant context
relevant_history = await dytto.search_context(
user_id=user_id,
query=task,
max_results=5
)
return f"""
<user_context>
Name: {user_data.name}
Expertise: {', '.join(user_data.expertise_areas)}
Style: {user_data.communication_style}
Recent relevant interactions:
{format_history(relevant_history)}
</user_context>
"""
Automatic Context Updates
# Dytto can automatically extract and store context from conversations
await dytto.observe(
user_id="user_123",
interaction={
"role": "user",
"content": "I prefer TypeScript over JavaScript for new projects"
}
)
# Later retrieval will include this preference
context = await dytto.get_context("user_123")
# context.preferences includes {"languages": {"typescript": "preferred"}}
Workflow Engineering: The Bigger Picture
Context engineering doesn't exist in isolation. It's part of a broader discipline: workflow engineering. While context engineering optimizes what goes into each LLM call, workflow engineering designs the sequence of calls and non-LLM steps needed to complete complex work.
Effective workflows:
- Define explicit step sequences: Map the progression of tasks
- Control context strategically: Decide when to use LLM vs. deterministic logic
- Ensure reliability: Build in validation and error handling
- Optimize for outcomes: Create specialized workflows for specific results
From a context engineering perspective, workflows prevent context overload. Instead of cramming everything into a single call, you break complex tasks into focused steps, each with its own optimized context window.
Real-World Case Study: Building a Context-Aware Customer Support Agent
Let's walk through a concrete example of applying context engineering principles to a production customer support agent.
The Challenge
A SaaS company wanted to build an AI agent that could handle tier-1 support tickets. Requirements included:
- Access to product documentation (500+ pages)
- Knowledge of each customer's account status and history
- Understanding of common issues and their resolutions
- Ability to escalate appropriately
Initial Approach (What Didn't Work)
The first attempt used a naive RAG approach: embed all documentation, retrieve top-k chunks for each query, stuff everything into context.
Problems emerged quickly:
- Context pollution: Irrelevant documentation chunks diluted attention from actual customer issues
- Missing personalization: The agent treated every customer identically, missing account-specific context
- No conversation continuity: Each message was processed independently, losing thread context
- Inconsistent escalation: Without historical patterns, the agent couldn't learn when to escalate
The Context Engineering Solution
The team redesigned using tiered context architecture:
Tier 1 - Always Present (System Context)
- Core agent instructions and persona
- Tool definitions for account lookup, ticket creation, escalation
- Output format requirements
Tier 2 - Per-Session (User Context via Dytto)
- Customer account status (plan, tenure, recent tickets)
- Interaction history patterns (communication style, common issues)
- Sentiment trends and escalation history
Tier 3 - Per-Message (Dynamic Retrieval)
- Relevant documentation chunks (semantic search on current query)
- Similar resolved tickets from knowledge base
- Current conversation thread (compressed if long)
Implementation Details
class SupportAgent:
def __init__(self):
self.dytto = DyttoClient(api_key=DYTTO_API_KEY)
self.doc_retriever = DocumentRetriever(index_path="./support_docs")
self.ticket_retriever = TicketRetriever(connection=db_conn)
async def handle_message(self, customer_id: str, message: str, thread: list):
# Tier 2: User context (cached, refreshed every 5 min)
user_context = await self.dytto.get_context(customer_id)
# Tier 3: Dynamic retrieval
relevant_docs = self.doc_retriever.search(message, top_k=3)
similar_tickets = self.ticket_retriever.search(message, top_k=2)
# Compress thread if over 10 messages
thread_context = self.format_thread(thread, compress_after=10)
# Build prompt with budget awareness
prompt = self.build_prompt(
user_context=user_context,
docs=relevant_docs,
tickets=similar_tickets,
thread=thread_context,
current_message=message,
max_tokens=50000
)
response = await self.llm.complete(prompt)
# Update context with this interaction
await self.dytto.observe(
user_id=customer_id,
interaction={"message": message, "response": response}
)
return response
Results
After implementing proper context engineering:
- Resolution rate: Increased from 45% to 78% (agent could solve more issues without escalation)
- Customer satisfaction: +23% improvement in post-chat ratings
- Context utilization: Retrieved documents were used in 89% of responses (vs. 34% before)
- Escalation accuracy: False escalations dropped by 67%
The key insight: the same model (Claude) performed dramatically differently with thoughtful context engineering versus naive context stuffing.
Advanced Techniques: Context Scheduling
For long-running agents, context scheduling becomes important. Not all context needs to be present at all times—some can be loaded on-demand, some can be preemptively cached.
Lazy Loading Patterns
class LazyContextLoader:
"""Load context only when explicitly requested by the agent."""
def __init__(self):
self.loaded = {}
self.references = {}
def register(self, key: str, loader: callable):
"""Register a lazy loader for a context type."""
self.references[key] = loader
def get(self, key: str) -> str:
"""Load and cache context on first access."""
if key not in self.loaded:
if key not in self.references:
raise KeyError(f"Unknown context key: {key}")
self.loaded[key] = self.references[key]()
return self.loaded[key]
def invalidate(self, key: str):
"""Force reload on next access."""
self.loaded.pop(key, None)
# Usage
loader = LazyContextLoader()
loader.register("user_profile", lambda: fetch_user_profile(user_id))
loader.register("account_history", lambda: fetch_account_history(user_id))
loader.register("product_docs", lambda: fetch_relevant_docs(query))
# In tool definition exposed to agent
def load_context(context_type: str) -> str:
"""Load additional context into the conversation."""
return loader.get(context_type)
Preemptive Caching
For predictable access patterns, preemptively load context that will likely be needed:
async def preload_context(user_id: str, task_type: str):
"""Preload context based on task type predictions."""
cache = ContextCache(ttl_seconds=300)
# Always preload user context
cache.set(f"{user_id}:profile", await fetch_user_profile(user_id))
# Task-specific preloading
if task_type == "support":
cache.set(f"{user_id}:tickets", await fetch_recent_tickets(user_id))
cache.set(f"{user_id}:account", await fetch_account_status(user_id))
elif task_type == "coding":
cache.set(f"{user_id}:repos", await fetch_repo_structure(user_id))
cache.set(f"{user_id}:recent_files", await fetch_recent_edits(user_id))
return cache
Conclusion: The Future of Context Engineering
Context engineering is becoming the critical infrastructure challenge for AI agents in 2026. As models become more capable, the bottleneck shifts from model intelligence to context curation.
Key takeaways:
- Treat context as a finite resource: Every token has a cost against your attention budget
- Design for dynamic retrieval: Let agents explore and load context just-in-time
- Invest in user context: Personal context dramatically improves agent effectiveness
- Start simple: Sophisticated retrieval systems aren't always necessary
- Measure and iterate: Track operational metrics to improve context strategies
The best agents aren't just using the smartest models—they're the ones with the most thoughtful context engineering. Tools like Dytto make user context management tractable, letting you focus on the unique challenges of your application while standing on solid context infrastructure.
Ready to build context-aware agents? Explore Dytto's API documentation to see how personal context layers can enhance your agent architecture.
This guide is part of Dytto's series on building production AI agents. For more technical deep-dives, check out our articles on AI Memory for Agents and Persistent Memory for LLMs.