Persistent User Context for LLMs: The Complete Developer Guide to Building Stateful AI Applications
Persistent User Context for LLMs: The Complete Developer Guide to Building Stateful AI Applications
Every time you start a conversation with an LLM, it has no idea who you are. The model that helped you refactor authentication code yesterday doesn't remember that conversation today. The assistant that learned your coding style last week starts fresh this morning, asking the same clarifying questions all over again.
This is the fundamental challenge of building AI applications with large language models: they're stateless by design. Every API call exists in isolation. No memory of previous interactions persists unless you explicitly engineer it.
In this comprehensive guide, we'll explore how to implement persistent user context for LLMs—the architectures, patterns, and practical code that allow AI applications to remember users across sessions, learn from interactions over time, and deliver genuinely personalized experiences. Whether you're building a coding assistant, customer support agent, or personal AI companion, understanding persistent context is essential for creating applications that feel intelligent rather than amnesiac.
Why LLMs Don't Remember: The Stateless Foundation
Before diving into solutions, let's understand why this problem exists in the first place.
Large language models are designed as stateless functions. Given an input prompt, they produce an output. There's no internal state that carries over between requests. This design choice makes scaling straightforward—any server in a cluster can handle any request without coordination—but it creates a fundamental disconnect between user expectations and technical reality.
Consider what happens without persistent context:
# Session 1 - Monday
response = llm.chat("I'm a Python developer working on a Django project")
# LLM acknowledges your context
# Session 2 - Tuesday
response = llm.chat("What's the best way to structure my models?")
# LLM has no idea you're working with Django or that you prefer Python
# Gives generic advice that could apply to any framework
Users intuitively expect AI assistants to remember them. When they say "remember, I prefer TypeScript" or "I told you I'm vegetarian," they expect that information to persist. Without explicit engineering, it doesn't.
The challenge breaks down into several distinct problems:
Session continuity: How do you maintain context within a single conversation that might span multiple API calls?
Cross-session memory: How do you remember information from previous conversations that happened days or weeks ago?
Selective retrieval: How do you surface only the relevant context for each interaction without overwhelming the model's context window?
Information updates: How do you handle changes when a user says "actually, I moved to Berlin" after previously telling you they live in New York?
The Three Layers of Context Persistence
Production systems that solve persistent context typically implement three distinct layers, each handling different temporal scopes:
Layer 1: Short-Term Memory (Session Context)
Short-term memory handles context within a single conversation session. This is the simplest layer—you're essentially passing the conversation history as part of each API call.
class SessionMemory:
def __init__(self):
self.messages = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def get_context(self) -> list:
return self.messages
def chat(self, user_message: str) -> str:
self.add_message("user", user_message)
response = llm.chat(
messages=self.messages,
system="You are a helpful assistant."
)
self.add_message("assistant", response)
return response
Most LLM SDKs handle this automatically. The challenge emerges when conversations grow long enough to exceed context window limits, requiring truncation or summarization strategies:
def manage_context_window(messages: list, max_tokens: int = 4000) -> list:
"""Keep conversation within token budget."""
total_tokens = sum(count_tokens(m["content"]) for m in messages)
if total_tokens <= max_tokens:
return messages
# Strategy 1: Sliding window - keep recent messages
while total_tokens > max_tokens and len(messages) > 2:
removed = messages.pop(1) # Keep system message, remove oldest
total_tokens -= count_tokens(removed["content"])
return messages
A more sophisticated approach uses summarization to compress older context:
async def summarize_and_compress(messages: list, threshold: int = 3000) -> list:
"""Summarize older messages when approaching context limit."""
total_tokens = sum(count_tokens(m["content"]) for m in messages)
if total_tokens < threshold:
return messages
# Find messages to summarize (older half of conversation)
midpoint = len(messages) // 2
to_summarize = messages[1:midpoint] # Skip system message
summary = await llm.summarize(
content="\n".join(m["content"] for m in to_summarize),
instruction="Summarize the key points and decisions from this conversation"
)
# Replace old messages with summary
return [
messages[0], # System message
{"role": "system", "content": f"Previous conversation summary: {summary}"},
*messages[midpoint:] # Recent messages
]
Layer 2: Long-Term Memory (Persistent Facts)
Long-term memory stores information that persists across sessions—user preferences, learned facts, historical decisions. This requires external storage and a retrieval mechanism.
The architecture typically involves:
- Fact extraction: Analyzing conversations to identify information worth storing
- Structured storage: Persisting facts with metadata (user ID, timestamp, category, importance)
- Retrieval: Finding relevant facts when building context for new conversations
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import uuid
@dataclass
class Memory:
id: str
user_id: str
content: str
category: str # preference, fact, decision, etc.
importance: float # 0.0 to 1.0
created_at: datetime
supersedes: Optional[str] = None # ID of memory this replaces
class LongTermMemory:
def __init__(self, storage, embedding_model):
self.storage = storage
self.embedding_model = embedding_model
async def extract_and_store(self, user_id: str, conversation: list):
"""Extract facts from conversation and store as memories."""
# Use LLM to extract facts
extraction_prompt = """
Analyze this conversation and extract any facts about the user that should be remembered.
Return JSON with format: [{"fact": "...", "category": "preference|fact|decision", "importance": 0.0-1.0}]
Conversation:
{conversation}
"""
facts = await llm.extract(
prompt=extraction_prompt.format(
conversation=format_conversation(conversation)
)
)
for fact in facts:
memory = Memory(
id=str(uuid.uuid4()),
user_id=user_id,
content=fact["fact"],
category=fact["category"],
importance=fact["importance"],
created_at=datetime.utcnow()
)
# Check for conflicts with existing memories
conflicts = await self.find_conflicts(user_id, memory)
if conflicts:
# Mark old memory as superseded
memory.supersedes = conflicts[0].id
await self.storage.mark_superseded(conflicts[0].id)
embedding = self.embedding_model.encode(memory.content)
await self.storage.insert(memory, embedding)
async def retrieve(self, user_id: str, query: str, limit: int = 10) -> list:
"""Retrieve relevant memories for a query."""
query_embedding = self.embedding_model.encode(query)
# Retrieve candidates by semantic similarity
candidates = await self.storage.search(
user_id=user_id,
embedding=query_embedding,
limit=limit * 2 # Over-fetch for re-ranking
)
# Re-rank with recency and importance
scored = []
now = datetime.utcnow()
for memory in candidates:
age_days = (now - memory.created_at).days
recency_score = 1.0 / (1.0 + age_days / 30) # Decay over ~30 days
final_score = (
0.4 * memory.similarity +
0.35 * recency_score +
0.25 * memory.importance
)
scored.append((memory, final_score))
scored.sort(key=lambda x: x[1], reverse=True)
return [m for m, _ in scored[:limit]]
Layer 3: Working Memory (Task State)
Working memory tracks intermediate state during complex, multi-step tasks. Unlike session context (which is conversation history) or long-term memory (which is persistent facts), working memory is task-specific scratch space.
@dataclass
class WorkingMemory:
task_id: str
user_id: str
state: dict
created_at: datetime
expires_at: datetime
class TaskStateManager:
def __init__(self, storage):
self.storage = storage
async def create_task(self, user_id: str, task_type: str, initial_state: dict) -> str:
"""Initialize working memory for a new task."""
task_id = str(uuid.uuid4())
working_memory = WorkingMemory(
task_id=task_id,
user_id=user_id,
state={
"type": task_type,
"status": "in_progress",
**initial_state
},
created_at=datetime.utcnow(),
expires_at=datetime.utcnow() + timedelta(hours=24)
)
await self.storage.save(working_memory)
return task_id
async def update_state(self, task_id: str, updates: dict):
"""Update working memory state."""
memory = await self.storage.get(task_id)
memory.state.update(updates)
await self.storage.save(memory)
async def get_active_tasks(self, user_id: str) -> list:
"""Get all active tasks for a user."""
return await self.storage.query(
user_id=user_id,
status="in_progress",
not_expired=True
)
Storage Backends for Persistent Context
Choosing the right storage backend depends on your scale, latency requirements, and existing infrastructure.
PostgreSQL with pgvector
For teams already running PostgreSQL, pgvector provides vector similarity search without adding new infrastructure:
from sqlalchemy import create_engine, Column, String, Float, DateTime, JSON
from sqlalchemy.orm import declarative_base, Session
from pgvector.sqlalchemy import Vector
Base = declarative_base()
class MemoryRecord(Base):
__tablename__ = "memories"
id = Column(String, primary_key=True)
user_id = Column(String, index=True)
content = Column(String)
category = Column(String)
importance = Column(Float)
embedding = Column(Vector(1536)) # OpenAI ada-002 dimension
created_at = Column(DateTime)
superseded_by = Column(String, nullable=True)
metadata = Column(JSON, nullable=True)
class PostgresMemoryStore:
def __init__(self, connection_string: str):
self.engine = create_engine(connection_string)
Base.metadata.create_all(self.engine)
def search(self, user_id: str, embedding: list, limit: int = 10) -> list:
with Session(self.engine) as session:
results = session.query(MemoryRecord)\
.filter(MemoryRecord.user_id == user_id)\
.filter(MemoryRecord.superseded_by.is_(None))\
.order_by(MemoryRecord.embedding.cosine_distance(embedding))\
.limit(limit)\
.all()
return results
Redis for High-Throughput Workloads
Redis offers sub-millisecond latency and built-in vector search since version 7.2:
import redis
from redis.commands.search.query import Query
from redis.commands.search.field import TextField, NumericField, VectorField
class RedisMemoryStore:
def __init__(self, redis_url: str):
self.client = redis.from_url(redis_url)
self._create_index()
def _create_index(self):
try:
self.client.ft("memory_idx").create_index([
TextField("user_id"),
TextField("content"),
TextField("category"),
NumericField("importance"),
NumericField("created_at"),
VectorField(
"embedding",
"FLAT",
{"TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE"}
)
])
except redis.exceptions.ResponseError:
pass # Index already exists
def insert(self, memory_id: str, data: dict, embedding: list):
key = f"memory:{memory_id}"
self.client.hset(key, mapping={
**data,
"embedding": np.array(embedding, dtype=np.float32).tobytes()
})
def search(self, user_id: str, embedding: list, limit: int = 10) -> list:
query_vector = np.array(embedding, dtype=np.float32).tobytes()
q = Query(f"@user_id:{user_id}=>[KNN {limit} @embedding $vec AS score]")\
.return_fields("content", "category", "importance", "score")\
.sort_by("score")\
.dialect(2)
results = self.client.ft("memory_idx").search(
q, query_params={"vec": query_vector}
)
return results.docs
LangChain Checkpointers
If you're already using LangChain or LangGraph, their checkpointer abstractions provide plug-and-play persistence:
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph
# Configure persistent checkpointing
checkpointer = PostgresSaver.from_conn_string(
"postgresql://user:pass@localhost/db"
)
# Build graph with persistence enabled
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
# ... configure edges
app = graph.compile(checkpointer=checkpointer)
# Each invocation persists state
result = app.invoke(
{"messages": [HumanMessage(content="Hello")]},
config={"configurable": {"thread_id": "user-123-session-456"}}
)
# Later, resume the same thread
result = app.invoke(
{"messages": [HumanMessage(content="Continue from before")]},
config={"configurable": {"thread_id": "user-123-session-456"}}
)
Context Injection Patterns
Once you've stored persistent context, you need to inject it effectively into your LLM prompts. Here are the primary patterns:
System Prompt Injection
The simplest approach: include user context in the system prompt.
def build_system_prompt(user_context: dict) -> str:
base_prompt = """You are a helpful AI assistant. Use the context below to personalize your responses."""
context_sections = []
if user_context.get("preferences"):
context_sections.append(
"User Preferences:\n" +
"\n".join(f"- {p}" for p in user_context["preferences"])
)
if user_context.get("facts"):
context_sections.append(
"Known Facts About User:\n" +
"\n".join(f"- {f}" for f in user_context["facts"])
)
if user_context.get("recent_topics"):
context_sections.append(
"Recent Conversation Topics:\n" +
"\n".join(f"- {t}" for t in user_context["recent_topics"])
)
if context_sections:
return base_prompt + "\n\n" + "\n\n".join(context_sections)
return base_prompt
Retrieval-Augmented Context
For systems with extensive context, retrieve only what's relevant to the current query:
async def build_contextual_prompt(
user_id: str,
query: str,
memory_store: LongTermMemory,
token_budget: int = 1500
) -> str:
# Retrieve relevant memories
memories = await memory_store.retrieve(user_id, query, limit=20)
# Build context within token budget
context_parts = []
tokens_used = 0
for memory in memories:
memory_tokens = count_tokens(memory.content)
if tokens_used + memory_tokens > token_budget:
break
context_parts.append(f"- {memory.content}")
tokens_used += memory_tokens
if not context_parts:
return ""
return "Relevant context about this user:\n" + "\n".join(context_parts)
User Profile Objects
Structure context as a well-defined profile that gets injected consistently:
@dataclass
class UserProfile:
user_id: str
name: Optional[str] = None
preferences: list = field(default_factory=list)
facts: list = field(default_factory=list)
recent_interactions: list = field(default_factory=list)
def to_context_string(self) -> str:
parts = []
if self.name:
parts.append(f"User: {self.name}")
if self.preferences:
parts.append("Preferences: " + ", ".join(self.preferences))
if self.facts:
parts.append("Known facts: " + "; ".join(self.facts))
return "\n".join(parts)
async def get_user_profile(user_id: str, memory_store) -> UserProfile:
"""Construct user profile from stored memories."""
all_memories = await memory_store.get_all(user_id)
profile = UserProfile(user_id=user_id)
for memory in all_memories:
if memory.category == "preference":
profile.preferences.append(memory.content)
elif memory.category == "fact":
profile.facts.append(memory.content)
elif memory.category == "identity" and "name" in memory.content.lower():
# Extract name from identity facts
profile.name = extract_name(memory.content)
return profile
Handling Context Updates and Conflicts
Real users change their minds. They move cities, switch jobs, update preferences. A robust context system must handle updates gracefully.
Supersession Pattern
When new information conflicts with old, mark the old memory as superseded:
async def update_memory(
user_id: str,
new_fact: str,
category: str,
memory_store: LongTermMemory
):
# Find potentially conflicting memories
existing = await memory_store.search_by_category(
user_id=user_id,
category=category,
query=new_fact,
limit=5
)
# Use LLM to check for conflicts
for existing_memory in existing:
conflict_check = await llm.check_conflict(
fact_1=existing_memory.content,
fact_2=new_fact
)
if conflict_check.is_conflict:
# Mark old memory as superseded
new_memory = Memory(
id=str(uuid.uuid4()),
user_id=user_id,
content=new_fact,
category=category,
importance=max(0.7, existing_memory.importance),
created_at=datetime.utcnow(),
supersedes=existing_memory.id
)
await memory_store.mark_superseded(existing_memory.id)
await memory_store.insert(new_memory)
return
# No conflict - just add new memory
await memory_store.insert(Memory(
id=str(uuid.uuid4()),
user_id=user_id,
content=new_fact,
category=category,
importance=0.5,
created_at=datetime.utcnow()
))
Temporal Versioning
For applications requiring audit trails, maintain full history with timestamps:
class VersionedMemoryStore:
async def update(self, user_id: str, key: str, value: str):
"""Update a memory while preserving history."""
current = await self.get_current(user_id, key)
if current:
# Archive current version
await self.archive(current)
# Insert new version
await self.insert(Memory(
user_id=user_id,
key=key,
value=value,
version=current.version + 1 if current else 1,
created_at=datetime.utcnow(),
is_current=True
))
async def get_history(self, user_id: str, key: str) -> list:
"""Get all versions of a memory."""
return await self.query(
user_id=user_id,
key=key,
order_by="version DESC"
)
Building with Dytto: A Production Context Layer
Implementing persistent context from scratch requires significant engineering: storage infrastructure, embedding pipelines, conflict resolution, privacy controls. Dytto provides this as a managed service, letting developers add persistent personalization with a few API calls.
Core Integration
import dytto
client = dytto.Client(api_key="your_api_key")
# Store context from any conversation
client.observe(
user_id="user_123",
content="I'm a backend engineer who primarily uses Python and Go. Currently working on a microservices migration at a fintech startup."
)
# Retrieve relevant context for any query
context = client.get_context(
user_id="user_123",
query="What testing framework should I use?"
)
# Returns: user's language preferences (Python, Go), work context (microservices, fintech)
Automatic Fact Extraction
Dytto automatically extracts and structures facts from raw conversation content:
# You observe raw conversation
client.observe(
user_id="user_123",
content="Actually, I just switched from Python to Rust for the performance-critical services. Still using Python for the API layer though."
)
# Dytto extracts structured facts:
# - Uses Rust for performance-critical services
# - Uses Python for API layer
# - Previously used Python more extensively (superseded)
Conflict Resolution
When users update information, Dytto handles supersession automatically:
# First observation
client.observe(user_id="user_123", content="I live in San Francisco")
# Later observation
client.observe(user_id="user_123", content="I moved to Austin last month")
# Context retrieval returns Austin, not San Francisco
context = client.get_context(user_id="user_123", query="local recommendations")
Privacy Controls
Users can view and delete their stored context:
# Export all stored context
export = client.export_context(user_id="user_123")
# Delete specific memories
client.delete_context(user_id="user_123", category="location")
# Delete all user data
client.delete_user(user_id="user_123")
Performance Optimization
Persistent context adds latency and cost. Here's how to optimize:
Aggressive Caching
Cache context retrievals for short periods—user context rarely changes mid-conversation:
from functools import lru_cache
import time
class CachedContextStore:
def __init__(self, memory_store, ttl_seconds=300):
self.memory_store = memory_store
self.ttl = ttl_seconds
self.cache = {}
async def get_context(self, user_id: str, query: str) -> list:
cache_key = f"{user_id}:{hash(query)}"
if cache_key in self.cache:
cached, timestamp = self.cache[cache_key]
if time.time() - timestamp < self.ttl:
return cached
result = await self.memory_store.retrieve(user_id, query)
self.cache[cache_key] = (result, time.time())
return result
Batch Memory Operations
Don't write to memory on every message—batch and process asynchronously:
class AsyncMemoryProcessor:
def __init__(self, memory_store, batch_size=10, flush_interval=60):
self.memory_store = memory_store
self.pending = []
self.batch_size = batch_size
self.flush_interval = flush_interval
self.last_flush = time.time()
def queue(self, user_id: str, content: str):
self.pending.append((user_id, content))
if len(self.pending) >= self.batch_size:
asyncio.create_task(self.flush())
elif time.time() - self.last_flush > self.flush_interval:
asyncio.create_task(self.flush())
async def flush(self):
if not self.pending:
return
to_process = self.pending
self.pending = []
self.last_flush = time.time()
# Process batch
await self.memory_store.batch_extract_and_store(to_process)
Hierarchical Retrieval
For users with extensive history, implement tiered retrieval:
async def hierarchical_retrieve(
user_id: str,
query: str,
memory_store
) -> list:
# Tier 1: Recent memories (fast, in hot storage)
recent = await memory_store.search(
user_id=user_id,
query=query,
time_range="last_30_days",
limit=10
)
if len(recent) >= 5 and recent[0].score > 0.8:
return recent # Good enough, skip deeper search
# Tier 2: Full history search (slower, but comprehensive)
all_results = await memory_store.search(
user_id=user_id,
query=query,
time_range="all_time",
limit=20
)
# Merge, preferring recent if scores are close
return merge_with_recency_bias(recent, all_results)
Testing Persistent Context Systems
Memory systems require specific testing strategies:
Extraction Accuracy Tests
Verify that fact extraction captures what matters:
def test_extracts_preferences():
conversation = [
{"role": "user", "content": "I prefer dark mode in all my apps"},
{"role": "assistant", "content": "I'll note that preference."}
]
facts = extract_facts(conversation)
assert any(
"dark mode" in f.content.lower() and
f.category == "preference"
for f in facts
)
def test_extracts_technical_context():
conversation = [
{"role": "user", "content": "I'm building a FastAPI backend with PostgreSQL"}
]
facts = extract_facts(conversation)
assert any("FastAPI" in f.content for f in facts)
assert any("PostgreSQL" in f.content for f in facts)
Retrieval Relevance Tests
Test that the right context surfaces for queries:
async def test_retrieves_relevant_context():
# Setup
await memory_store.insert(user_id="test", content="allergic to shellfish")
await memory_store.insert(user_id="test", content="loves Italian food")
await memory_store.insert(user_id="test", content="works at Google")
# Test food query gets food-related context
results = await memory_store.retrieve("test", "recommend a restaurant")
contents = [r.content for r in results]
assert any("shellfish" in c or "Italian" in c for c in contents)
assert not any("Google" in c for c in contents[:3]) # Work context less relevant
Supersession Tests
Verify that updates replace old information:
async def test_supersession():
await memory_store.insert(user_id="test", content="lives in NYC")
await memory_store.insert(user_id="test", content="moved to Austin")
results = await memory_store.retrieve("test", "local recommendations")
# Austin should appear, NYC should not
contents = " ".join(r.content for r in results)
assert "Austin" in contents
assert "NYC" not in contents or "moved from NYC" in contents
Isolation Tests
Critical for multi-tenant systems—verify user data doesn't leak:
async def test_user_isolation():
await memory_store.insert(user_id="alice", content="SSN 123-45-6789")
await memory_store.insert(user_id="bob", content="prefers morning meetings")
bob_results = await memory_store.retrieve("bob", "personal information")
# Bob should NEVER see Alice's SSN
all_content = " ".join(r.content for r in bob_results)
assert "123-45-6789" not in all_content
assert "SSN" not in all_content
Conclusion: From Stateless to Stateful AI
The gap between user expectations and LLM capabilities creates a fundamental product problem. Users expect AI assistants to remember them, learn from interactions, and provide personalized experiences. Out-of-the-box LLMs do none of this.
Persistent user context bridges this gap. By implementing the patterns in this guide—session memory, long-term fact storage, intelligent retrieval, and proper conflict handling—you can build AI applications that genuinely understand their users over time.
The key architectural decisions:
-
Layer your memory: Short-term (session), long-term (persistent facts), and working (task state) serve different purposes
-
Store facts, not transcripts: Extract and structure information rather than persisting raw conversation logs
-
Retrieval is the hard part: Combining semantic similarity, recency, and importance scoring determines whether the right context surfaces
-
Handle updates explicitly: Users change—your system needs supersession logic, not just append-only storage
-
Partition by user: Multi-tenant isolation isn't optional—it's a security requirement
For teams that want persistent context without building infrastructure from scratch, services like Dytto provide these capabilities as managed APIs. But whether you build or buy, the principles remain the same: intelligent AI applications need memory, and memory needs thoughtful engineering.
The stateless LLM is a foundation. Persistent context is what transforms it into an assistant that actually knows you.
Ready to add persistent context to your AI application? Explore Dytto's context API and see how user-aware AI can transform your product.