Back to Blog

Persistent User Context for LLMs: The Complete Developer Guide to Building Stateful AI Applications

Dytto Team
dyttoai-memoryllmcontext-engineeringdeveloper-guidepersonalization

Persistent User Context for LLMs: The Complete Developer Guide to Building Stateful AI Applications

Every time you start a conversation with an LLM, it has no idea who you are. The model that helped you refactor authentication code yesterday doesn't remember that conversation today. The assistant that learned your coding style last week starts fresh this morning, asking the same clarifying questions all over again.

This is the fundamental challenge of building AI applications with large language models: they're stateless by design. Every API call exists in isolation. No memory of previous interactions persists unless you explicitly engineer it.

In this comprehensive guide, we'll explore how to implement persistent user context for LLMs—the architectures, patterns, and practical code that allow AI applications to remember users across sessions, learn from interactions over time, and deliver genuinely personalized experiences. Whether you're building a coding assistant, customer support agent, or personal AI companion, understanding persistent context is essential for creating applications that feel intelligent rather than amnesiac.

Why LLMs Don't Remember: The Stateless Foundation

Before diving into solutions, let's understand why this problem exists in the first place.

Large language models are designed as stateless functions. Given an input prompt, they produce an output. There's no internal state that carries over between requests. This design choice makes scaling straightforward—any server in a cluster can handle any request without coordination—but it creates a fundamental disconnect between user expectations and technical reality.

Consider what happens without persistent context:

# Session 1 - Monday
response = llm.chat("I'm a Python developer working on a Django project")
# LLM acknowledges your context

# Session 2 - Tuesday
response = llm.chat("What's the best way to structure my models?")
# LLM has no idea you're working with Django or that you prefer Python
# Gives generic advice that could apply to any framework

Users intuitively expect AI assistants to remember them. When they say "remember, I prefer TypeScript" or "I told you I'm vegetarian," they expect that information to persist. Without explicit engineering, it doesn't.

The challenge breaks down into several distinct problems:

Session continuity: How do you maintain context within a single conversation that might span multiple API calls?

Cross-session memory: How do you remember information from previous conversations that happened days or weeks ago?

Selective retrieval: How do you surface only the relevant context for each interaction without overwhelming the model's context window?

Information updates: How do you handle changes when a user says "actually, I moved to Berlin" after previously telling you they live in New York?

The Three Layers of Context Persistence

Production systems that solve persistent context typically implement three distinct layers, each handling different temporal scopes:

Layer 1: Short-Term Memory (Session Context)

Short-term memory handles context within a single conversation session. This is the simplest layer—you're essentially passing the conversation history as part of each API call.

class SessionMemory:
    def __init__(self):
        self.messages = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_context(self) -> list:
        return self.messages

    def chat(self, user_message: str) -> str:
        self.add_message("user", user_message)

        response = llm.chat(
            messages=self.messages,
            system="You are a helpful assistant."
        )

        self.add_message("assistant", response)
        return response

Most LLM SDKs handle this automatically. The challenge emerges when conversations grow long enough to exceed context window limits, requiring truncation or summarization strategies:

def manage_context_window(messages: list, max_tokens: int = 4000) -> list:
    """Keep conversation within token budget."""
    total_tokens = sum(count_tokens(m["content"]) for m in messages)

    if total_tokens <= max_tokens:
        return messages

    # Strategy 1: Sliding window - keep recent messages
    while total_tokens > max_tokens and len(messages) > 2:
        removed = messages.pop(1)  # Keep system message, remove oldest
        total_tokens -= count_tokens(removed["content"])

    return messages

A more sophisticated approach uses summarization to compress older context:

async def summarize_and_compress(messages: list, threshold: int = 3000) -> list:
    """Summarize older messages when approaching context limit."""
    total_tokens = sum(count_tokens(m["content"]) for m in messages)

    if total_tokens < threshold:
        return messages

    # Find messages to summarize (older half of conversation)
    midpoint = len(messages) // 2
    to_summarize = messages[1:midpoint]  # Skip system message

    summary = await llm.summarize(
        content="\n".join(m["content"] for m in to_summarize),
        instruction="Summarize the key points and decisions from this conversation"
    )

    # Replace old messages with summary
    return [
        messages[0],  # System message
        {"role": "system", "content": f"Previous conversation summary: {summary}"},
        *messages[midpoint:]  # Recent messages
    ]

Layer 2: Long-Term Memory (Persistent Facts)

Long-term memory stores information that persists across sessions—user preferences, learned facts, historical decisions. This requires external storage and a retrieval mechanism.

The architecture typically involves:

  1. Fact extraction: Analyzing conversations to identify information worth storing
  2. Structured storage: Persisting facts with metadata (user ID, timestamp, category, importance)
  3. Retrieval: Finding relevant facts when building context for new conversations
from dataclasses import dataclass
from datetime import datetime
from typing import Optional
import uuid

@dataclass
class Memory:
    id: str
    user_id: str
    content: str
    category: str  # preference, fact, decision, etc.
    importance: float  # 0.0 to 1.0
    created_at: datetime
    supersedes: Optional[str] = None  # ID of memory this replaces

class LongTermMemory:
    def __init__(self, storage, embedding_model):
        self.storage = storage
        self.embedding_model = embedding_model

    async def extract_and_store(self, user_id: str, conversation: list):
        """Extract facts from conversation and store as memories."""
        # Use LLM to extract facts
        extraction_prompt = """
        Analyze this conversation and extract any facts about the user that should be remembered.
        Return JSON with format: [{"fact": "...", "category": "preference|fact|decision", "importance": 0.0-1.0}]

        Conversation:
        {conversation}
        """

        facts = await llm.extract(
            prompt=extraction_prompt.format(
                conversation=format_conversation(conversation)
            )
        )

        for fact in facts:
            memory = Memory(
                id=str(uuid.uuid4()),
                user_id=user_id,
                content=fact["fact"],
                category=fact["category"],
                importance=fact["importance"],
                created_at=datetime.utcnow()
            )

            # Check for conflicts with existing memories
            conflicts = await self.find_conflicts(user_id, memory)
            if conflicts:
                # Mark old memory as superseded
                memory.supersedes = conflicts[0].id
                await self.storage.mark_superseded(conflicts[0].id)

            embedding = self.embedding_model.encode(memory.content)
            await self.storage.insert(memory, embedding)

    async def retrieve(self, user_id: str, query: str, limit: int = 10) -> list:
        """Retrieve relevant memories for a query."""
        query_embedding = self.embedding_model.encode(query)

        # Retrieve candidates by semantic similarity
        candidates = await self.storage.search(
            user_id=user_id,
            embedding=query_embedding,
            limit=limit * 2  # Over-fetch for re-ranking
        )

        # Re-rank with recency and importance
        scored = []
        now = datetime.utcnow()
        for memory in candidates:
            age_days = (now - memory.created_at).days
            recency_score = 1.0 / (1.0 + age_days / 30)  # Decay over ~30 days

            final_score = (
                0.4 * memory.similarity +
                0.35 * recency_score +
                0.25 * memory.importance
            )
            scored.append((memory, final_score))

        scored.sort(key=lambda x: x[1], reverse=True)
        return [m for m, _ in scored[:limit]]

Layer 3: Working Memory (Task State)

Working memory tracks intermediate state during complex, multi-step tasks. Unlike session context (which is conversation history) or long-term memory (which is persistent facts), working memory is task-specific scratch space.

@dataclass
class WorkingMemory:
    task_id: str
    user_id: str
    state: dict
    created_at: datetime
    expires_at: datetime

class TaskStateManager:
    def __init__(self, storage):
        self.storage = storage

    async def create_task(self, user_id: str, task_type: str, initial_state: dict) -> str:
        """Initialize working memory for a new task."""
        task_id = str(uuid.uuid4())

        working_memory = WorkingMemory(
            task_id=task_id,
            user_id=user_id,
            state={
                "type": task_type,
                "status": "in_progress",
                **initial_state
            },
            created_at=datetime.utcnow(),
            expires_at=datetime.utcnow() + timedelta(hours=24)
        )

        await self.storage.save(working_memory)
        return task_id

    async def update_state(self, task_id: str, updates: dict):
        """Update working memory state."""
        memory = await self.storage.get(task_id)
        memory.state.update(updates)
        await self.storage.save(memory)

    async def get_active_tasks(self, user_id: str) -> list:
        """Get all active tasks for a user."""
        return await self.storage.query(
            user_id=user_id,
            status="in_progress",
            not_expired=True
        )

Storage Backends for Persistent Context

Choosing the right storage backend depends on your scale, latency requirements, and existing infrastructure.

PostgreSQL with pgvector

For teams already running PostgreSQL, pgvector provides vector similarity search without adding new infrastructure:

from sqlalchemy import create_engine, Column, String, Float, DateTime, JSON
from sqlalchemy.orm import declarative_base, Session
from pgvector.sqlalchemy import Vector

Base = declarative_base()

class MemoryRecord(Base):
    __tablename__ = "memories"

    id = Column(String, primary_key=True)
    user_id = Column(String, index=True)
    content = Column(String)
    category = Column(String)
    importance = Column(Float)
    embedding = Column(Vector(1536))  # OpenAI ada-002 dimension
    created_at = Column(DateTime)
    superseded_by = Column(String, nullable=True)
    metadata = Column(JSON, nullable=True)

class PostgresMemoryStore:
    def __init__(self, connection_string: str):
        self.engine = create_engine(connection_string)
        Base.metadata.create_all(self.engine)

    def search(self, user_id: str, embedding: list, limit: int = 10) -> list:
        with Session(self.engine) as session:
            results = session.query(MemoryRecord)\
                .filter(MemoryRecord.user_id == user_id)\
                .filter(MemoryRecord.superseded_by.is_(None))\
                .order_by(MemoryRecord.embedding.cosine_distance(embedding))\
                .limit(limit)\
                .all()
            return results

Redis for High-Throughput Workloads

Redis offers sub-millisecond latency and built-in vector search since version 7.2:

import redis
from redis.commands.search.query import Query
from redis.commands.search.field import TextField, NumericField, VectorField

class RedisMemoryStore:
    def __init__(self, redis_url: str):
        self.client = redis.from_url(redis_url)
        self._create_index()

    def _create_index(self):
        try:
            self.client.ft("memory_idx").create_index([
                TextField("user_id"),
                TextField("content"),
                TextField("category"),
                NumericField("importance"),
                NumericField("created_at"),
                VectorField(
                    "embedding",
                    "FLAT",
                    {"TYPE": "FLOAT32", "DIM": 1536, "DISTANCE_METRIC": "COSINE"}
                )
            ])
        except redis.exceptions.ResponseError:
            pass  # Index already exists

    def insert(self, memory_id: str, data: dict, embedding: list):
        key = f"memory:{memory_id}"
        self.client.hset(key, mapping={
            **data,
            "embedding": np.array(embedding, dtype=np.float32).tobytes()
        })

    def search(self, user_id: str, embedding: list, limit: int = 10) -> list:
        query_vector = np.array(embedding, dtype=np.float32).tobytes()

        q = Query(f"@user_id:{user_id}=>[KNN {limit} @embedding $vec AS score]")\
            .return_fields("content", "category", "importance", "score")\
            .sort_by("score")\
            .dialect(2)

        results = self.client.ft("memory_idx").search(
            q, query_params={"vec": query_vector}
        )
        return results.docs

LangChain Checkpointers

If you're already using LangChain or LangGraph, their checkpointer abstractions provide plug-and-play persistence:

from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.graph import StateGraph

# Configure persistent checkpointing
checkpointer = PostgresSaver.from_conn_string(
    "postgresql://user:pass@localhost/db"
)

# Build graph with persistence enabled
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", tool_node)
# ... configure edges

app = graph.compile(checkpointer=checkpointer)

# Each invocation persists state
result = app.invoke(
    {"messages": [HumanMessage(content="Hello")]},
    config={"configurable": {"thread_id": "user-123-session-456"}}
)

# Later, resume the same thread
result = app.invoke(
    {"messages": [HumanMessage(content="Continue from before")]},
    config={"configurable": {"thread_id": "user-123-session-456"}}
)

Context Injection Patterns

Once you've stored persistent context, you need to inject it effectively into your LLM prompts. Here are the primary patterns:

System Prompt Injection

The simplest approach: include user context in the system prompt.

def build_system_prompt(user_context: dict) -> str:
    base_prompt = """You are a helpful AI assistant. Use the context below to personalize your responses."""

    context_sections = []

    if user_context.get("preferences"):
        context_sections.append(
            "User Preferences:\n" +
            "\n".join(f"- {p}" for p in user_context["preferences"])
        )

    if user_context.get("facts"):
        context_sections.append(
            "Known Facts About User:\n" +
            "\n".join(f"- {f}" for f in user_context["facts"])
        )

    if user_context.get("recent_topics"):
        context_sections.append(
            "Recent Conversation Topics:\n" +
            "\n".join(f"- {t}" for t in user_context["recent_topics"])
        )

    if context_sections:
        return base_prompt + "\n\n" + "\n\n".join(context_sections)
    return base_prompt

Retrieval-Augmented Context

For systems with extensive context, retrieve only what's relevant to the current query:

async def build_contextual_prompt(
    user_id: str,
    query: str,
    memory_store: LongTermMemory,
    token_budget: int = 1500
) -> str:
    # Retrieve relevant memories
    memories = await memory_store.retrieve(user_id, query, limit=20)

    # Build context within token budget
    context_parts = []
    tokens_used = 0

    for memory in memories:
        memory_tokens = count_tokens(memory.content)
        if tokens_used + memory_tokens > token_budget:
            break
        context_parts.append(f"- {memory.content}")
        tokens_used += memory_tokens

    if not context_parts:
        return ""

    return "Relevant context about this user:\n" + "\n".join(context_parts)

User Profile Objects

Structure context as a well-defined profile that gets injected consistently:

@dataclass
class UserProfile:
    user_id: str
    name: Optional[str] = None
    preferences: list = field(default_factory=list)
    facts: list = field(default_factory=list)
    recent_interactions: list = field(default_factory=list)

    def to_context_string(self) -> str:
        parts = []

        if self.name:
            parts.append(f"User: {self.name}")

        if self.preferences:
            parts.append("Preferences: " + ", ".join(self.preferences))

        if self.facts:
            parts.append("Known facts: " + "; ".join(self.facts))

        return "\n".join(parts)

async def get_user_profile(user_id: str, memory_store) -> UserProfile:
    """Construct user profile from stored memories."""
    all_memories = await memory_store.get_all(user_id)

    profile = UserProfile(user_id=user_id)

    for memory in all_memories:
        if memory.category == "preference":
            profile.preferences.append(memory.content)
        elif memory.category == "fact":
            profile.facts.append(memory.content)
        elif memory.category == "identity" and "name" in memory.content.lower():
            # Extract name from identity facts
            profile.name = extract_name(memory.content)

    return profile

Handling Context Updates and Conflicts

Real users change their minds. They move cities, switch jobs, update preferences. A robust context system must handle updates gracefully.

Supersession Pattern

When new information conflicts with old, mark the old memory as superseded:

async def update_memory(
    user_id: str,
    new_fact: str,
    category: str,
    memory_store: LongTermMemory
):
    # Find potentially conflicting memories
    existing = await memory_store.search_by_category(
        user_id=user_id,
        category=category,
        query=new_fact,
        limit=5
    )

    # Use LLM to check for conflicts
    for existing_memory in existing:
        conflict_check = await llm.check_conflict(
            fact_1=existing_memory.content,
            fact_2=new_fact
        )

        if conflict_check.is_conflict:
            # Mark old memory as superseded
            new_memory = Memory(
                id=str(uuid.uuid4()),
                user_id=user_id,
                content=new_fact,
                category=category,
                importance=max(0.7, existing_memory.importance),
                created_at=datetime.utcnow(),
                supersedes=existing_memory.id
            )

            await memory_store.mark_superseded(existing_memory.id)
            await memory_store.insert(new_memory)
            return

    # No conflict - just add new memory
    await memory_store.insert(Memory(
        id=str(uuid.uuid4()),
        user_id=user_id,
        content=new_fact,
        category=category,
        importance=0.5,
        created_at=datetime.utcnow()
    ))

Temporal Versioning

For applications requiring audit trails, maintain full history with timestamps:

class VersionedMemoryStore:
    async def update(self, user_id: str, key: str, value: str):
        """Update a memory while preserving history."""
        current = await self.get_current(user_id, key)

        if current:
            # Archive current version
            await self.archive(current)

        # Insert new version
        await self.insert(Memory(
            user_id=user_id,
            key=key,
            value=value,
            version=current.version + 1 if current else 1,
            created_at=datetime.utcnow(),
            is_current=True
        ))

    async def get_history(self, user_id: str, key: str) -> list:
        """Get all versions of a memory."""
        return await self.query(
            user_id=user_id,
            key=key,
            order_by="version DESC"
        )

Building with Dytto: A Production Context Layer

Implementing persistent context from scratch requires significant engineering: storage infrastructure, embedding pipelines, conflict resolution, privacy controls. Dytto provides this as a managed service, letting developers add persistent personalization with a few API calls.

Core Integration

import dytto

client = dytto.Client(api_key="your_api_key")

# Store context from any conversation
client.observe(
    user_id="user_123",
    content="I'm a backend engineer who primarily uses Python and Go. Currently working on a microservices migration at a fintech startup."
)

# Retrieve relevant context for any query
context = client.get_context(
    user_id="user_123",
    query="What testing framework should I use?"
)
# Returns: user's language preferences (Python, Go), work context (microservices, fintech)

Automatic Fact Extraction

Dytto automatically extracts and structures facts from raw conversation content:

# You observe raw conversation
client.observe(
    user_id="user_123",
    content="Actually, I just switched from Python to Rust for the performance-critical services. Still using Python for the API layer though."
)

# Dytto extracts structured facts:
# - Uses Rust for performance-critical services
# - Uses Python for API layer
# - Previously used Python more extensively (superseded)

Conflict Resolution

When users update information, Dytto handles supersession automatically:

# First observation
client.observe(user_id="user_123", content="I live in San Francisco")

# Later observation
client.observe(user_id="user_123", content="I moved to Austin last month")

# Context retrieval returns Austin, not San Francisco
context = client.get_context(user_id="user_123", query="local recommendations")

Privacy Controls

Users can view and delete their stored context:

# Export all stored context
export = client.export_context(user_id="user_123")

# Delete specific memories
client.delete_context(user_id="user_123", category="location")

# Delete all user data
client.delete_user(user_id="user_123")

Performance Optimization

Persistent context adds latency and cost. Here's how to optimize:

Aggressive Caching

Cache context retrievals for short periods—user context rarely changes mid-conversation:

from functools import lru_cache
import time

class CachedContextStore:
    def __init__(self, memory_store, ttl_seconds=300):
        self.memory_store = memory_store
        self.ttl = ttl_seconds
        self.cache = {}

    async def get_context(self, user_id: str, query: str) -> list:
        cache_key = f"{user_id}:{hash(query)}"

        if cache_key in self.cache:
            cached, timestamp = self.cache[cache_key]
            if time.time() - timestamp < self.ttl:
                return cached

        result = await self.memory_store.retrieve(user_id, query)
        self.cache[cache_key] = (result, time.time())
        return result

Batch Memory Operations

Don't write to memory on every message—batch and process asynchronously:

class AsyncMemoryProcessor:
    def __init__(self, memory_store, batch_size=10, flush_interval=60):
        self.memory_store = memory_store
        self.pending = []
        self.batch_size = batch_size
        self.flush_interval = flush_interval
        self.last_flush = time.time()

    def queue(self, user_id: str, content: str):
        self.pending.append((user_id, content))

        if len(self.pending) >= self.batch_size:
            asyncio.create_task(self.flush())
        elif time.time() - self.last_flush > self.flush_interval:
            asyncio.create_task(self.flush())

    async def flush(self):
        if not self.pending:
            return

        to_process = self.pending
        self.pending = []
        self.last_flush = time.time()

        # Process batch
        await self.memory_store.batch_extract_and_store(to_process)

Hierarchical Retrieval

For users with extensive history, implement tiered retrieval:

async def hierarchical_retrieve(
    user_id: str,
    query: str,
    memory_store
) -> list:
    # Tier 1: Recent memories (fast, in hot storage)
    recent = await memory_store.search(
        user_id=user_id,
        query=query,
        time_range="last_30_days",
        limit=10
    )

    if len(recent) >= 5 and recent[0].score > 0.8:
        return recent  # Good enough, skip deeper search

    # Tier 2: Full history search (slower, but comprehensive)
    all_results = await memory_store.search(
        user_id=user_id,
        query=query,
        time_range="all_time",
        limit=20
    )

    # Merge, preferring recent if scores are close
    return merge_with_recency_bias(recent, all_results)

Testing Persistent Context Systems

Memory systems require specific testing strategies:

Extraction Accuracy Tests

Verify that fact extraction captures what matters:

def test_extracts_preferences():
    conversation = [
        {"role": "user", "content": "I prefer dark mode in all my apps"},
        {"role": "assistant", "content": "I'll note that preference."}
    ]

    facts = extract_facts(conversation)

    assert any(
        "dark mode" in f.content.lower() and
        f.category == "preference"
        for f in facts
    )

def test_extracts_technical_context():
    conversation = [
        {"role": "user", "content": "I'm building a FastAPI backend with PostgreSQL"}
    ]

    facts = extract_facts(conversation)

    assert any("FastAPI" in f.content for f in facts)
    assert any("PostgreSQL" in f.content for f in facts)

Retrieval Relevance Tests

Test that the right context surfaces for queries:

async def test_retrieves_relevant_context():
    # Setup
    await memory_store.insert(user_id="test", content="allergic to shellfish")
    await memory_store.insert(user_id="test", content="loves Italian food")
    await memory_store.insert(user_id="test", content="works at Google")

    # Test food query gets food-related context
    results = await memory_store.retrieve("test", "recommend a restaurant")

    contents = [r.content for r in results]
    assert any("shellfish" in c or "Italian" in c for c in contents)
    assert not any("Google" in c for c in contents[:3])  # Work context less relevant

Supersession Tests

Verify that updates replace old information:

async def test_supersession():
    await memory_store.insert(user_id="test", content="lives in NYC")
    await memory_store.insert(user_id="test", content="moved to Austin")

    results = await memory_store.retrieve("test", "local recommendations")

    # Austin should appear, NYC should not
    contents = " ".join(r.content for r in results)
    assert "Austin" in contents
    assert "NYC" not in contents or "moved from NYC" in contents

Isolation Tests

Critical for multi-tenant systems—verify user data doesn't leak:

async def test_user_isolation():
    await memory_store.insert(user_id="alice", content="SSN 123-45-6789")
    await memory_store.insert(user_id="bob", content="prefers morning meetings")

    bob_results = await memory_store.retrieve("bob", "personal information")

    # Bob should NEVER see Alice's SSN
    all_content = " ".join(r.content for r in bob_results)
    assert "123-45-6789" not in all_content
    assert "SSN" not in all_content

Conclusion: From Stateless to Stateful AI

The gap between user expectations and LLM capabilities creates a fundamental product problem. Users expect AI assistants to remember them, learn from interactions, and provide personalized experiences. Out-of-the-box LLMs do none of this.

Persistent user context bridges this gap. By implementing the patterns in this guide—session memory, long-term fact storage, intelligent retrieval, and proper conflict handling—you can build AI applications that genuinely understand their users over time.

The key architectural decisions:

  1. Layer your memory: Short-term (session), long-term (persistent facts), and working (task state) serve different purposes

  2. Store facts, not transcripts: Extract and structure information rather than persisting raw conversation logs

  3. Retrieval is the hard part: Combining semantic similarity, recency, and importance scoring determines whether the right context surfaces

  4. Handle updates explicitly: Users change—your system needs supersession logic, not just append-only storage

  5. Partition by user: Multi-tenant isolation isn't optional—it's a security requirement

For teams that want persistent context without building infrastructure from scratch, services like Dytto provide these capabilities as managed APIs. But whether you build or buy, the principles remain the same: intelligent AI applications need memory, and memory needs thoughtful engineering.

The stateless LLM is a foundation. Persistent context is what transforms it into an assistant that actually knows you.


Ready to add persistent context to your AI application? Explore Dytto's context API and see how user-aware AI can transform your product.

All posts
Published on