How to Build a Chatbot with Long Term Memory: The Complete Developer Guide

Most chatbots suffer from digital amnesia. Ask them your name in one session, tell them about your project, share your preferences—then start a new conversation and watch them forget everything. This isn't just frustrating for users; it fundamentally limits what conversational AI can accomplish.

The difference between a chatbot and a true AI assistant often comes down to memory. A chatbot with long term memory can maintain context across days, weeks, or months of interaction. It remembers that you prefer concise answers, that you're working on a React project, that you hate when AI says "Great question!" before answering.

This guide covers everything you need to build chatbots with persistent memory: the underlying architectures, implementation patterns, and production considerations that separate toy demos from systems users actually want to use.

The Memory Problem in Conversational AI

Before diving into solutions, let's understand why this is hard.

Large language models (LLMs) have no inherent memory. Each API call is stateless—the model receives a prompt, generates a response, and immediately forgets everything. What feels like "memory" in tools like ChatGPT is actually the conversation history being passed with each request.

This creates two fundamental constraints:

Context window limits: Every model has a maximum number of tokens it can process. GPT-4 handles 128K tokens, Claude can manage 200K, and Gemini Pro pushes 1M—but even these large windows fill up. Once the conversation history exceeds the limit, older messages get truncated or the request fails entirely.

Session boundaries: When a user closes their browser, the conversation history disappears. Starting a new chat means starting from zero, even if the user has been interacting with your bot for months.

These aren't just technical inconveniences. They prevent chatbots from:

Learning user preferences over time
Maintaining context for ongoing projects or tasks
Building genuine rapport through accumulated interaction history
Handling complex, multi-session workflows

Memory Types: What Your Chatbot Can Remember

Not all memory serves the same purpose. Understanding the different types helps you architect systems that remember the right things at the right times.

Short-Term Memory (Session Context)

This is the conversation history within a single session. It's what most chatbots implement by default—the rolling window of recent messages passed to the LLM with each request.

Implementation: Store messages in an array, truncate when approaching context limits, pass with each API call.

Limitations: Lost when session ends, grows linearly with conversation length, no semantic prioritization.

Working Memory (Active Context)

Working memory holds information the chatbot is actively using, even if it wasn't mentioned recently. Think of it like a human's mental workspace—the things you're thinking about right now, regardless of when you last heard them.

Example: User mentions they're debugging a NullPointerException in message 5. In message 47, they ask "why isn't it working?" Working memory connects "it" to the debugging context, even though 42 messages have passed.

Implementation: Summarization of conversation threads, topic tracking, entity extraction that persists key facts.

Long-Term Memory (Persistent Storage)

Long-term memory survives across sessions. This is where you store:

User preferences ("prefers concise answers")
Biographical facts ("works at Stripe, on the payments team")
Interaction patterns ("usually asks follow-up questions about error handling")
Project context ("building a mobile app in Flutter")

Implementation: External database (vector store, key-value, relational), retrieval system, injection into prompts.

Episodic Memory (Conversation History)

Episodic memory stores specific interactions as distinct events. Unlike semantic long-term memory (which distills facts), episodic memory preserves the narrative—when something happened, what the conversation flow was, the full context of a particular exchange.

Example: "Last Tuesday, you asked me to explain Kubernetes networking, and I walked you through Service types, then you asked about LoadBalancer vs NodePort."

Implementation: Structured logging of conversations with timestamps, searchable by date/topic/content.

Architecture Patterns for Long-Term Memory

Now let's get concrete. There are several established patterns for implementing persistent memory in chatbots.

Pattern 1: Vector Store + Retrieval Augmented Generation (RAG)

This is the most common approach for adding long-term memory to chatbots. The idea:

Convert conversation snippets into vector embeddings
Store embeddings in a vector database
When the user sends a message, retrieve relevant past context
Inject retrieved context into the LLM prompt

How it works in practice:

from langchain_openai import OpenAIEmbeddings
from langchain_milvus.vectorstores import Milvus
from langchain.memory import VectorStoreRetrieverMemory

# Initialize embedding model and vector store
embeddings = OpenAIEmbeddings()
vectordb = Milvus(
    embeddings,
    connection_args={"uri": "./memory.db"},
)

# Create retriever that finds top-k relevant memories
retriever = vectordb.as_retriever(search_kwargs={"k": 5})
memory = VectorStoreRetrieverMemory(retriever=retriever)

# Save important context
memory.save_context(
    {"input": "I'm a frontend developer working on React Native"},
    {"output": "Got it, I'll keep that in mind!"}
)

# Later, when user asks a question, relevant context is retrieved
memory.load_memory_variables({"input": "How should I structure my mobile app?"})
# Returns the React Native context because it's semantically relevant

Strengths:

Scales to arbitrary conversation lengths
Retrieval is semantic (finds conceptually relevant memories, not just keyword matches)
Works across sessions naturally

Weaknesses:

Embedding quality matters a lot—poor embeddings mean irrelevant retrieval
No guarantee the most important context is retrieved
Requires managing another infrastructure component (the vector database)

Vector databases to consider:

Milvus: High-performance, open source, handles billion-scale vectors
Pinecone: Managed service, easy to start, pay-per-query pricing
Chroma: Lightweight, embeddable, good for prototypes
pgvector: PostgreSQL extension, uses existing infrastructure

Pattern 2: Structured Memory with Knowledge Graphs

Instead of storing raw conversation chunks, extract structured facts and store them in a knowledge graph or relational database.

How it works:

After each conversation turn, extract entities and relationships
Store in a structured format: (User, prefers, concise_answers), (User, works_on, Project_X)
Query the knowledge base for relevant facts when generating responses

Example implementation:

from typing import List, Dict
import json

class StructuredMemory:
    def __init__(self):
        self.facts: Dict[str, List[str]] = {}  # category -> facts
    
    def extract_and_store(self, conversation: str, llm):
        """Use LLM to extract facts from conversation."""
        prompt = f"""Extract key facts from this conversation.
        Return as JSON: {{"preferences": [...], "biographical": [...], "projects": [...]}}
        
        Conversation:
        {conversation}"""
        
        extracted = llm.generate(prompt)
        facts = json.loads(extracted)
        
        for category, items in facts.items():
            if category not in self.facts:
                self.facts[category] = []
            self.facts[category].extend(items)
    
    def get_context(self, categories: List[str]) -> str:
        """Retrieve facts for specific categories."""
        context = []
        for cat in categories:
            if cat in self.facts:
                context.append(f"{cat}: {', '.join(self.facts[cat])}")
        return "\n".join(context)

Strengths:

Clean, organized memory with explicit semantics
Easy to update or correct specific facts
Can handle contradictions (update old facts with new ones)

Weaknesses:

Extraction quality depends on LLM capability
Loses nuance and context present in raw conversations
More complex to implement than vector storage

Pattern 3: Hierarchical Memory with Compression

Recent research (including Google's Titans architecture and the MIRAS framework) points to hierarchical memory as a powerful approach. The idea: maintain memory at multiple levels of abstraction.

How it works:

Recent history: Full conversation for the last N turns (verbatim)
Session summaries: Compressed summaries of older parts of current session
Long-term summaries: High-level summaries of past sessions
Core facts: Extracted, persistent user information

class HierarchicalMemory:
    def __init__(self, llm):
        self.llm = llm
        self.recent_turns = []        # Last 10 messages
        self.session_summary = ""     # Summary of current session
        self.past_summaries = []      # Summaries of past sessions
        self.core_facts = []          # Persistent user facts
    
    def add_turn(self, user_msg: str, assistant_msg: str):
        self.recent_turns.append({"user": user_msg, "assistant": assistant_msg})
        
        # Compress older turns into summary
        if len(self.recent_turns) > 10:
            to_compress = self.recent_turns[:5]
            self.recent_turns = self.recent_turns[5:]
            
            summary_prompt = f"Summarize this conversation segment: {to_compress}"
            segment_summary = self.llm.generate(summary_prompt)
            self.session_summary += f"\n{segment_summary}"
    
    def end_session(self):
        """Called when session ends—compress everything to long-term."""
        full_session = f"{self.session_summary}\n{self.recent_turns}"
        
        # Create session summary
        summary = self.llm.generate(f"Summarize this session: {full_session}")
        self.past_summaries.append(summary)
        
        # Extract any new core facts
        facts = self.llm.generate(
            f"Extract new biographical/preference facts: {full_session}"
        )
        self.core_facts.extend(facts)
        
        # Reset session state
        self.recent_turns = []
        self.session_summary = ""
    
    def build_context(self) -> str:
        """Build context string for LLM prompt."""
        context = f"""## Core User Facts
{self.core_facts}

## Previous Session Summaries
{self.past_summaries[-3:]}  # Last 3 sessions

## Current Session Summary
{self.session_summary}

## Recent Conversation
{self.recent_turns}"""
        return context

This approach mirrors how human memory works: vivid recall of recent events, increasingly compressed memories of older events, and distilled core knowledge that persists indefinitely.

Pattern 4: The "Surprise" Metric for Selective Memory

Not everything needs to be remembered. Google's Titans research introduces an elegant concept: the "surprise metric." The model remembers things that are unexpected or information-dense, and forgets routine exchanges.

How to implement this:

def calculate_surprise(message: str, expected_embedding, actual_embedding):
    """High surprise = message contains unexpected information."""
    from numpy import dot
    from numpy.linalg import norm
    
    similarity = dot(expected_embedding, actual_embedding) / (
        norm(expected_embedding) * norm(actual_embedding)
    )
    surprise = 1 - similarity
    return surprise

def should_memorize(message: str, conversation_context: str, llm, threshold=0.7):
    """Decide if a message is worth storing in long-term memory."""
    
    # Get embedding of what we'd expect given context
    expected = llm.get_embedding(
        f"Given this context: {conversation_context}\nPredict next message:"
    )
    
    # Get embedding of actual message
    actual = llm.get_embedding(message)
    
    surprise = calculate_surprise(message, expected, actual)
    
    # Also check for explicit memory-worthy content
    explicit_check = llm.generate(
        f"Does this contain user preferences, biographical info, or project details? "
        f"Message: {message}. Answer yes/no."
    )
    
    return surprise > threshold or "yes" in explicit_check.lower()

This prevents your memory store from filling up with "yes," "ok," and "thanks!" while ensuring you capture genuinely important information.

Real-World Implementation: Step by Step

Let's build a production-ready chatbot with long-term memory.

Step 1: Choose Your Stack

Recommended stack for most use cases:

LLM: OpenAI GPT-4 or Claude (for main conversation)
Embeddings: OpenAI text-embedding-3-small or Cohere embed-v3
Vector DB: Milvus (self-hosted) or Pinecone (managed)
Framework: LangChain (for orchestration)
Persistence: PostgreSQL (for user data, session metadata)

Step 2: Design Your Memory Schema

Before writing code, define what you're storing:

# Memory types we'll persist
MEMORY_SCHEMA = {
    "user_facts": {
        # Biographical information, preferences, work context
        "examples": ["prefers concise answers", "works at Stripe", "building React app"]
    },
    "conversation_summaries": {
        # Compressed past conversations
        "examples": ["2024-03-15: Discussed Kubernetes deployment strategies"]
    },
    "episodic_memories": {
        # Specific notable interactions
        "examples": ["User asked about rate limiting, I recommended token bucket"]
    },
    "project_context": {
        # Ongoing projects and their states
        "examples": ["Project: Mobile App, Stack: Flutter, Stage: MVP"]
    }
}

Step 3: Implement the Memory Layer

import os
from datetime import datetime
from typing import List, Dict, Optional
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_milvus import Milvus
from dataclasses import dataclass
import json

@dataclass
class MemoryEntry:
    content: str
    memory_type: str  # user_fact, summary, episodic, project
    timestamp: datetime
    session_id: str
    user_id: str

class ChatbotMemory:
    def __init__(self, user_id: str):
        self.user_id = user_id
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.llm = ChatOpenAI(model="gpt-4o")
        
        # Initialize vector store with user-specific namespace
        self.vectordb = Milvus(
            self.embeddings,
            collection_name=f"user_{user_id}_memory",
            connection_args={"uri": os.getenv("MILVUS_URI")}
        )
    
    def store_memory(self, entry: MemoryEntry):
        """Store a memory entry with metadata."""
        self.vectordb.add_texts(
            texts=[entry.content],
            metadatas=[{
                "memory_type": entry.memory_type,
                "timestamp": entry.timestamp.isoformat(),
                "session_id": entry.session_id,
                "user_id": entry.user_id
            }]
        )
    
    def retrieve_relevant(self, query: str, k: int = 5, 
                          memory_types: Optional[List[str]] = None) -> List[Dict]:
        """Retrieve memories relevant to a query."""
        # Build filter if specific types requested
        filter_expr = None
        if memory_types:
            types_str = ", ".join([f'"{t}"' for t in memory_types])
            filter_expr = f"memory_type in [{types_str}]"
        
        results = self.vectordb.similarity_search_with_score(
            query, k=k, filter=filter_expr
        )
        
        return [
            {
                "content": doc.page_content,
                "metadata": doc.metadata,
                "score": score
            }
            for doc, score in results
        ]
    
    def extract_facts_from_conversation(self, conversation: str) -> List[str]:
        """Use LLM to extract memorable facts from conversation."""
        prompt = f"""Analyze this conversation and extract facts worth remembering.
        
        Focus on:
        - User preferences (communication style, likes/dislikes)
        - Biographical info (job, location, technical background)
        - Project details (what they're building, tech stack)
        - Stated goals or challenges
        
        Return as JSON list of strings. Only include substantial facts, 
        not greetings or filler.
        
        Conversation:
        {conversation}
        
        Facts (JSON list):"""
        
        response = self.llm.invoke(prompt)
        try:
            return json.loads(response.content)
        except:
            return []
    
    def summarize_session(self, conversation: str) -> str:
        """Create a compressed summary of a conversation session."""
        prompt = f"""Summarize this conversation in 2-3 sentences.
        Focus on: main topics discussed, decisions made, action items.
        
        Conversation:
        {conversation}
        
        Summary:"""
        
        response = self.llm.invoke(prompt)
        return response.content

    def build_context_for_prompt(self, current_message: str) -> str:
        """Build the memory context to inject into LLM prompt."""
        # Get relevant memories of each type
        user_facts = self.retrieve_relevant(
            current_message, k=3, memory_types=["user_fact"]
        )
        past_context = self.retrieve_relevant(
            current_message, k=2, memory_types=["summary", "episodic"]
        )
        project_context = self.retrieve_relevant(
            current_message, k=2, memory_types=["project"]
        )
        
        context = "## What I Know About You\n"
        if user_facts:
            context += "\n".join([m["content"] for m in user_facts])
        
        context += "\n\n## Relevant Past Conversations\n"
        if past_context:
            context += "\n".join([m["content"] for m in past_context])
        
        context += "\n\n## Your Current Projects\n"
        if project_context:
            context += "\n".join([m["content"] for m in project_context])
        
        return context

Step 4: Build the Conversation Handler

class MemoryChatbot:
    def __init__(self, user_id: str, session_id: str):
        self.user_id = user_id
        self.session_id = session_id
        self.memory = ChatbotMemory(user_id)
        self.llm = ChatOpenAI(model="gpt-4o")
        self.current_conversation = []  # Current session history
    
    def chat(self, user_message: str) -> str:
        """Process a user message and generate response."""
        # Add user message to current conversation
        self.current_conversation.append({"role": "user", "content": user_message})
        
        # Retrieve relevant long-term memory
        memory_context = self.memory.build_context_for_prompt(user_message)
        
        # Build the full prompt
        system_prompt = f"""You are a helpful AI assistant with memory.
        
{memory_context}

Use the above context to personalize your responses. Reference past 
conversations naturally when relevant. Remember the user's preferences."""
        
        messages = [
            {"role": "system", "content": system_prompt}
        ] + self.current_conversation
        
        # Generate response
        response = self.llm.invoke(messages)
        assistant_message = response.content
        
        # Add to current conversation
        self.current_conversation.append(
            {"role": "assistant", "content": assistant_message}
        )
        
        # Check if this exchange contains memorable information
        self._process_for_memory(user_message, assistant_message)
        
        return assistant_message
    
    def _process_for_memory(self, user_message: str, assistant_message: str):
        """Extract and store any memorable content from this exchange."""
        exchange = f"User: {user_message}\nAssistant: {assistant_message}"
        
        # Extract facts (run periodically, not every turn for efficiency)
        if len(self.current_conversation) % 6 == 0:  # Every 3 exchanges
            facts = self.memory.extract_facts_from_conversation(exchange)
            for fact in facts:
                self.memory.store_memory(MemoryEntry(
                    content=fact,
                    memory_type="user_fact",
                    timestamp=datetime.now(),
                    session_id=self.session_id,
                    user_id=self.user_id
                ))
    
    def end_session(self):
        """Called when session ends—persist summary to long-term memory."""
        if not self.current_conversation:
            return
        
        # Create and store session summary
        full_conversation = "\n".join([
            f"{m['role']}: {m['content']}" 
            for m in self.current_conversation
        ])
        
        summary = self.memory.summarize_session(full_conversation)
        self.memory.store_memory(MemoryEntry(
            content=f"{datetime.now().strftime('%Y-%m-%d')}: {summary}",
            memory_type="summary",
            timestamp=datetime.now(),
            session_id=self.session_id,
            user_id=self.user_id
        ))
        
        # Extract any final facts we might have missed
        facts = self.memory.extract_facts_from_conversation(full_conversation)
        for fact in facts:
            self.memory.store_memory(MemoryEntry(
                content=fact,
                memory_type="user_fact",
                timestamp=datetime.now(),
                session_id=self.session_id,
                user_id=self.user_id
            ))

Step 5: Add Memory Management

Memories shouldn't grow unbounded. Implement cleanup and deduplication:

def deduplicate_memories(memory: ChatbotMemory, memory_type: str):
    """Remove duplicate or conflicting memories."""
    # Get all memories of this type
    all_memories = memory.vectordb.get(where={"memory_type": memory_type})
    
    # Use LLM to identify duplicates/conflicts
    dedup_prompt = f"""Analyze these memory entries and identify:
    1. Duplicates (same information, different wording)
    2. Conflicts (contradictory information)
    
    For duplicates, keep the most recent.
    For conflicts, keep the most recent as the current truth.
    
    Entries:
    {all_memories}
    
    Return JSON: {{"to_remove": [list of IDs to delete]}}"""
    
    # ... execute and remove duplicates

def handle_memory_updates(memory: ChatbotMemory, new_fact: str):
    """Check if new fact contradicts existing memory; update if so."""
    # Find similar existing memories
    similar = memory.retrieve_relevant(new_fact, k=3, memory_types=["user_fact"])
    
    if similar:
        check_prompt = f"""Does this new fact contradict or update any existing facts?
        
        New: {new_fact}
        Existing: {[m['content'] for m in similar]}
        
        If it's an update, specify which fact to replace."""
        
        # ... handle updates appropriately

Benchmarking Your Memory System

The LongMemEval benchmark (from ICLR 2025) provides a rigorous way to evaluate chatbot memory. It tests five capabilities:

Information extraction: Can the system extract and store important facts?
Multi-session reasoning: Can it connect information across different conversations?
Temporal reasoning: Does it understand when things happened?
Knowledge updates: Can it handle changed information correctly?
Abstention: Does it know when it doesn't have information (vs. hallucinating)?

The benchmark found that commercial chat assistants show a 30% accuracy drop when memorizing information across sustained interactions. This is the gap your implementation needs to close.

Key metrics to track:

Memory precision: What percentage of retrieved memories are actually relevant?
Memory recall: What percentage of relevant memories are retrieved?
Factual accuracy: Are stored facts correct?
Update handling: When facts change, are old versions properly superseded?
Retrieval latency: How long does memory lookup add to response time?

Production Considerations

Privacy and Data Handling

Long-term memory means storing user data. You need:

Clear consent: Users must understand what's being remembered
Data access: Let users view their stored memories
Deletion capability: "Forget me" must actually work
Encryption: Memories should be encrypted at rest
Retention policies: Define how long memories persist

Scaling Memory

As your user base grows:

User isolation: Each user's memories must be completely separate
Index management: Vector indices need periodic optimization
Storage costs: Plan for ~$0.10-0.50 per user per month for moderate usage
Retrieval latency: Keep under 100ms for good UX

Failure Modes

Anticipate and handle:

Memory pollution: Bad extractions contaminating the memory store
Stale context: Old information presented as current
Retrieval misses: Relevant memories not being found
Prompt injection: Malicious content in stored memories

The Future: Context as Infrastructure

Memory is becoming a first-class concern in AI systems. The research direction is clear:

Memory as transformation: Not just storage, but intelligent compression and synthesis
Structured extraction: Moving from raw text to semantic knowledge graphs
Instruction evolution: Memories that change how the agent behaves, not just what it knows

Services like Dytto provide memory and context as a managed layer, so you don't have to build and maintain this infrastructure yourself. Instead of implementing vector stores, extraction pipelines, and retrieval logic, you call an API:

# Store context about a user
response = dytto.context.store(
    user_id="user_123",
    fact="Prefers detailed technical explanations",
    category="preference"
)

# Retrieve relevant context for a query
context = dytto.context.get(
    user_id="user_123",
    query="How should I explain Kubernetes to them?"
)
# Returns: [{"fact": "Prefers detailed technical explanations", ...}]

This lets you focus on building great conversational experiences while the context layer handles the complexity of memory management.

Conclusion

Building a chatbot with long-term memory transforms it from a stateless Q&A tool into something approaching a genuine assistant. The technical foundations are now well-understood: vector stores for semantic retrieval, hierarchical memory for managing different timescales, and intelligent extraction to capture what matters.

The key insights:

Memory is multi-layered: Recent history, session summaries, long-term facts, and episodic memories serve different purposes
Not everything needs remembering: Use surprise metrics or explicit extraction to store only valuable information
Retrieval quality matters: Your system is only as good as its ability to find relevant memories
Handle updates gracefully: User information changes; your memory system must handle contradictions
Plan for scale and privacy: Memory systems touch sensitive data and grow over time

The difference between a chatbot users tolerate and one they actually enjoy often comes down to whether it remembers them. Build systems that remember, and you build systems worth returning to.

FAQ

How much does it cost to implement long-term memory?

Costs break down into: embedding generation ($0.0001/1K tokens for OpenAI), vector storage ($10-100/month depending on scale and provider), and additional LLM calls for extraction/summarization. For a moderate-usage app (1000 users, 10 sessions/month each), expect $50-200/month total.

Can I use local embedding models to reduce costs?

Yes. Models like Sentence-BERT, E5, or BGE can run locally with good quality. This eliminates embedding API costs but adds compute requirements. For most production systems, managed embedding services are simpler and sufficiently cheap.

How do I handle memory for different conversation topics?

Use metadata tagging to categorize memories by topic, project, or context. When retrieving, filter by relevant categories. Some systems use multiple vector collections—one per major topic—for cleaner separation.

What if the LLM extracts incorrect facts?

Build in verification loops. For high-stakes facts, confirm with the user ("Just to make sure I have this right—you said you're using PostgreSQL?"). Implement confidence scoring and only persist high-confidence extractions. Allow users to view and correct stored memories.

How do I prevent context window overflow with lots of memories?

Limit the number of retrieved memories per type (e.g., 3 user facts, 2 past summaries). Compress memories aggressively. Use a secondary LLM call to synthesize multiple memories into a single contextual paragraph if needed.

Should I store full conversations or just extracted facts?

Both. Store compressed summaries for episodic memory ("what we discussed") and extracted facts for semantic memory ("what I know about you"). The combination provides both narrative context and quick fact lookup.

How does Dytto help with chatbot memory?

Dytto provides user context and memory as a managed API. Instead of building your own vector store, extraction pipeline, and retrieval logic, you integrate Dytto's context API into your chatbot. It handles storage, retrieval, deduplication, and privacy—letting you focus on the conversation experience rather than infrastructure.