Persistent Memory for LLMs: The Complete Developer's Guide to Long-Term AI Recall
Persistent Memory for LLMs: The Complete Developer's Guide to Long-Term AI Recall
Your chatbot just asked the same user for their name—for the fifth time this month. They've been a customer for two years. This is why persistent memory isn't optional anymore.
If you're building AI applications that interact with users across multiple sessions, you've hit this wall. Your LLM works perfectly within a single conversation, but the moment the session ends, every preference, fact, and context you learned vanishes. Users repeat themselves. Personalization becomes impossible. Your AI feels less intelligent with each restart.
This guide covers everything developers need to know about implementing persistent memory for LLMs: architectural patterns, storage options, retrieval strategies, and practical code examples for building AI that actually remembers.
What Is Persistent Memory for LLMs?
Persistent memory is external storage that allows an LLM to retain and recall information across sessions, users, and extended time periods. Unlike the context window (which functions as working memory), persistent memory survives restarts and can store far more information than any context window allows.
Think of it as the difference between human short-term and long-term memory:
- Context window (short-term): What you're actively thinking about right now. Limited capacity, immediately accessible, volatile.
- Persistent memory (long-term): Facts, experiences, and patterns you've accumulated over time. Massive capacity, requires retrieval, durable.
The fundamental challenge is bridging these two systems. An LLM can only directly access information in its context window. Persistent memory must be retrieved and injected into context to influence the model's behavior.
Why This Matters More Than You Think
Without persistent memory, every LLM interaction is isolated. This creates cascading problems:
For users:
- Re-explaining context every session ("I told you last week I'm allergic to shellfish")
- No personalization despite repeated use
- Frustrating repetition that makes AI feel dumb
For developers:
- Inability to build learning systems that improve over time
- Wasted tokens re-discovering user preferences
- No differentiation from stateless competitors
For businesses:
- Poor retention as users abandon unpersonalized experiences
- Support costs from context-free agents asking obvious questions
- Missed opportunities for proactive assistance
The industry has recognized this gap. ChatGPT's memory feature, launched in 2024, was one of the most requested additions. Claude's Projects allow persistent context. But these consumer features don't solve the engineering challenge: how do you build production-grade persistent memory for your own LLM applications?
The Memory Hierarchy: Understanding What Goes Where
Before implementing anything, you need a mental model for different memory types and their appropriate storage locations.
Episodic Memory: What Happened
Episodic memory stores records of past events and interactions. It answers questions like: "What did we discuss last Tuesday?" or "What was the user's reaction to the previous recommendation?"
Characteristics:
- Time-stamped and sequential
- Can be summarized without losing essence
- Volume grows continuously
- Often needs similarity-based retrieval
Storage approach: Vector databases, conversation logs with embeddings, summarization pipelines.
Semantic Memory: What We Know
Semantic memory stores facts, relationships, and learned knowledge. It answers: "What is the user's preferred communication style?" or "What products has this customer purchased?"
Characteristics:
- Structured or semi-structured
- Can be updated (not just appended)
- Query patterns are predictable
- Often needs exact-match retrieval
Storage approach: Structured databases, knowledge graphs, user profiles, JSON documents.
Procedural Memory: How We Behave
Procedural memory influences the model's behavioral patterns. It encodes learned rules like: "This user prefers concise responses" or "Always verify before placing orders for this account."
Characteristics:
- Affects system prompt or behavior instructions
- Changes infrequently but has high impact
- Often extracted from repeated patterns
- Applies consistently across sessions
Storage approach: Dynamic system prompts, instruction templates, rule databases.
A Working Memory Architecture
Most production systems need all three types working together:
┌─────────────────────────────────────────────────────────────┐
│ LLM Context Window │
│ ┌─────────────┐ ┌────────────────┐ ┌────────────────────┐ │
│ │ System │ │ Retrieved │ │ Current │ │
│ │ Prompt │ │ Memories │ │ Conversation │ │
│ │(procedural) │ │ (episodic + │ │ (working) │ │
│ │ │ │ semantic) │ │ │ │
│ └─────────────┘ └────────────────┘ └────────────────────┘ │
└───────────────────────────┬─────────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌───────────────┐ ┌───────────────────────┐
│ Vector Store │ │ Relational │ │ Document Store │
│ (episodes) │ │ DB (facts) │ │ (profiles/rules) │
└─────────────────┘ └───────────────┘ └───────────────────────┘
Storage Options for Persistent Memory
The storage layer you choose determines what kinds of retrieval are possible, how memory scales, and what maintenance you'll need.
Option 1: Vector Databases for Semantic Search
Vector databases store information as numerical embeddings, enabling similarity-based retrieval. When the user asks about "that Italian restaurant we discussed," a vector search finds semantically related memories even if exact keywords don't match.
Popular choices:
- Chroma: Easy setup, good for prototyping, runs locally
- Pinecone: Managed service, excellent scale
- Qdrant: Open-source, strong filtering capabilities
- Weaviate: Hybrid search (vectors + keywords)
- pgvector: PostgreSQL extension, familiar SQL interface
Implementation pattern:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Initialize embedding model and vector store
embeddings = OpenAIEmbeddings()
vector_store = Chroma(
collection_name="user_memories",
embedding_function=embeddings,
persist_directory="./memory_db"
)
# Store a memory
def store_memory(user_id: str, content: str, metadata: dict = None):
"""Persist a memory to the vector store."""
metadata = metadata or {}
metadata["user_id"] = user_id
metadata["timestamp"] = datetime.utcnow().isoformat()
vector_store.add_texts(
texts=[content],
metadatas=[metadata]
)
vector_store.persist()
# Retrieve relevant memories
def retrieve_memories(user_id: str, query: str, k: int = 5):
"""Find memories relevant to the current query."""
results = vector_store.similarity_search(
query,
k=k,
filter={"user_id": user_id}
)
return [doc.page_content for doc in results]
Pros:
- Natural language queries work well
- Discovers non-obvious connections
- Handles unstructured data gracefully
Cons:
- Embeddings aren't perfect—retrieval can miss or hallucinate relevance
- Requires embedding model (cost + latency)
- Exact-match queries are awkward
Option 2: Relational Databases for Structured Facts
For well-defined facts with predictable query patterns, traditional databases shine. User preferences, transaction history, relationship data—anything with clear structure belongs here.
Implementation pattern:
import sqlite3
from datetime import datetime
class StructuredMemory:
def __init__(self, db_path: str = "memory.db"):
self.conn = sqlite3.connect(db_path)
self._init_tables()
def _init_tables(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS user_facts (
id INTEGER PRIMARY KEY,
user_id TEXT NOT NULL,
fact_type TEXT NOT NULL,
fact_key TEXT NOT NULL,
fact_value TEXT NOT NULL,
confidence REAL DEFAULT 1.0,
created_at TEXT,
updated_at TEXT,
UNIQUE(user_id, fact_type, fact_key)
)
""")
self.conn.commit()
def store_fact(self, user_id: str, fact_type: str,
key: str, value: str, confidence: float = 1.0):
"""Store or update a structured fact."""
now = datetime.utcnow().isoformat()
self.conn.execute("""
INSERT INTO user_facts (user_id, fact_type, fact_key,
fact_value, confidence, created_at, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(user_id, fact_type, fact_key) DO UPDATE SET
fact_value = excluded.fact_value,
confidence = excluded.confidence,
updated_at = excluded.updated_at
""", (user_id, fact_type, key, value, confidence, now, now))
self.conn.commit()
def get_facts(self, user_id: str, fact_type: str = None):
"""Retrieve facts for a user, optionally filtered by type."""
if fact_type:
cursor = self.conn.execute(
"SELECT fact_key, fact_value, confidence FROM user_facts "
"WHERE user_id = ? AND fact_type = ?",
(user_id, fact_type)
)
else:
cursor = self.conn.execute(
"SELECT fact_type, fact_key, fact_value FROM user_facts "
"WHERE user_id = ?",
(user_id,)
)
return cursor.fetchall()
Pros:
- Exact queries, predictable results
- Easy updates and corrections
- Efficient for known access patterns
Cons:
- Rigid schema requires upfront design
- Can't handle truly unstructured data
- Semantic search requires external tool
Option 3: Document Stores for Flexible Profiles
JSON-based document stores offer a middle ground: structure without rigid schemas. User profiles, conversation summaries, and evolving preferences fit naturally.
Implementation pattern:
import json
from pathlib import Path
class ProfileMemory:
def __init__(self, storage_dir: str = "./profiles"):
self.storage_dir = Path(storage_dir)
self.storage_dir.mkdir(exist_ok=True)
def _get_profile_path(self, user_id: str) -> Path:
return self.storage_dir / f"{user_id}.json"
def load_profile(self, user_id: str) -> dict:
"""Load user profile, creating default if needed."""
path = self._get_profile_path(user_id)
if path.exists():
return json.loads(path.read_text())
return {
"user_id": user_id,
"preferences": {},
"facts": {},
"interaction_style": {},
"created_at": datetime.utcnow().isoformat()
}
def update_profile(self, user_id: str, updates: dict):
"""Merge updates into existing profile."""
profile = self.load_profile(user_id)
def deep_merge(base, updates):
for key, value in updates.items():
if key in base and isinstance(base[key], dict) and isinstance(value, dict):
deep_merge(base[key], value)
else:
base[key] = value
deep_merge(profile, updates)
profile["updated_at"] = datetime.utcnow().isoformat()
path = self._get_profile_path(user_id)
path.write_text(json.dumps(profile, indent=2))
return profile
Option 4: Hybrid Systems (The Production Answer)
Real applications rarely use a single storage type. The most robust architectures combine approaches:
class HybridMemory:
"""
Combines structured facts, semantic search, and profiles
for comprehensive persistent memory.
"""
def __init__(self, user_id: str):
self.user_id = user_id
self.profiles = ProfileMemory()
self.facts = StructuredMemory()
self.episodes = VectorMemory() # Vector store wrapper
def remember(self, content: str, memory_type: str = "auto"):
"""
Intelligently store information based on content type.
"""
if memory_type == "auto":
memory_type = self._classify_memory(content)
if memory_type == "fact":
# Extract structured fact using LLM
fact = self._extract_fact(content)
self.facts.store_fact(
self.user_id,
fact["type"],
fact["key"],
fact["value"]
)
elif memory_type == "preference":
# Update user profile
pref = self._extract_preference(content)
self.profiles.update_profile(
self.user_id,
{"preferences": pref}
)
else:
# Default to episodic storage
self.episodes.store(self.user_id, content)
def recall(self, query: str, max_results: int = 10) -> str:
"""
Retrieve relevant memories from all sources.
Returns formatted string for context injection.
"""
memories = []
# Always include profile
profile = self.profiles.load_profile(self.user_id)
if profile.get("preferences"):
memories.append(f"User preferences: {json.dumps(profile['preferences'])}")
# Get relevant facts
facts = self.facts.get_facts(self.user_id)
if facts:
memories.append(f"Known facts: {facts}")
# Semantic search for relevant episodes
episodes = self.episodes.search(self.user_id, query, k=max_results)
if episodes:
memories.append(f"Relevant past interactions: {episodes}")
return "\n\n".join(memories)
Memory Extraction: Teaching LLMs to Remember
Storage is only half the problem. You also need to extract memorable information from conversations. Most LLMs don't naturally identify what's worth remembering.
Pattern 1: Explicit Extraction Prompts
After each conversation turn (or periodically), prompt the LLM to identify memorable content:
MEMORY_EXTRACTION_PROMPT = """
Analyze the following conversation and extract information worth remembering about the user.
Focus on:
1. Explicit preferences stated ("I prefer...", "I don't like...")
2. Personal facts (name, location, job, relationships)
3. Goals and objectives they're working toward
4. Past experiences they reference
5. Communication style preferences
Conversation:
{conversation}
Return a JSON object with extracted memories:
{{
"preferences": [{{"key": "...", "value": "..."}}],
"facts": [{{"type": "...", "content": "..."}}],
"goals": ["..."],
"episodes": ["..."]
}}
Only include information explicitly stated or strongly implied. Do not infer or assume.
"""
def extract_memories(conversation: str, llm) -> dict:
"""Use LLM to identify memorable content."""
response = llm.invoke(
MEMORY_EXTRACTION_PROMPT.format(conversation=conversation)
)
return json.loads(response.content)
Pattern 2: Continuous Memory Enrichment
Rather than extracting once, continuously refine memories as new information arrives:
ENRICHMENT_PROMPT = """
You are updating a user's memory profile based on new conversation data.
Current profile:
{current_profile}
New conversation:
{new_conversation}
Instructions:
1. Identify any new facts that should be added
2. Identify any existing facts that should be updated (newer info supersedes older)
3. Identify any facts that may now be contradicted and should be flagged
4. Identify patterns or preferences that emerge from repeated behavior
Return the updated profile in the same JSON format, with a "changes" field
documenting what was modified and why.
"""
class ContinuousMemoryEnricher:
def __init__(self, llm, memory_store: HybridMemory):
self.llm = llm
self.memory = memory_store
def process_conversation(self, user_id: str, conversation: str):
"""Enrich memory with information from new conversation."""
current_profile = self.memory.profiles.load_profile(user_id)
response = self.llm.invoke(
ENRICHMENT_PROMPT.format(
current_profile=json.dumps(current_profile),
new_conversation=conversation
)
)
updates = json.loads(response.content)
self.memory.profiles.update_profile(user_id, updates)
# Log changes for debugging/auditing
if "changes" in updates:
self._log_memory_changes(user_id, updates["changes"])
Pattern 3: User-Controlled Memory
Sometimes the best approach is letting users control what's remembered:
class UserControlledMemory:
"""
Memory system where users explicitly manage what's stored.
"""
MEMORY_COMMANDS = {
"remember": r"remember(?:\s+that)?\s+(.+)",
"forget": r"forget(?:\s+about)?\s+(.+)",
"what_do_you_know": r"what do you (?:know|remember) about (?:me|.+)",
}
def process_message(self, user_id: str, message: str) -> tuple[str, bool]:
"""
Check for memory commands and execute them.
Returns (response, was_command).
"""
for command, pattern in self.MEMORY_COMMANDS.items():
match = re.match(pattern, message, re.IGNORECASE)
if match:
return self._execute_command(user_id, command, match.group(1)), True
return None, False
def _execute_command(self, user_id: str, command: str, content: str) -> str:
if command == "remember":
self.memory.remember(user_id, content, explicit=True)
return f"I'll remember that: {content}"
elif command == "forget":
deleted = self.memory.forget(user_id, content)
if deleted:
return f"I've forgotten about {content}"
return f"I don't have any memories matching '{content}'"
elif command == "what_do_you_know":
memories = self.memory.get_all(user_id)
if memories:
return f"Here's what I remember about you:\n{self._format_memories(memories)}"
return "I don't have any stored memories about you yet."
Retrieval Strategies: Getting the Right Memories at the Right Time
Having memories stored is useless if you can't retrieve the right ones when needed. This is where most memory systems fall apart.
Strategy 1: Query-Based Retrieval
The simplest approach: embed the user's current message and find similar past content.
def simple_retrieval(query: str, user_id: str, k: int = 5) -> list[str]:
"""Retrieve memories most similar to current query."""
return vector_store.similarity_search(
query,
k=k,
filter={"user_id": user_id}
)
Problems:
- Current query may not capture what memories are actually needed
- Misses memories relevant to context but not query
- No weighting for importance or recency
Strategy 2: Multi-Query Retrieval
Generate multiple search queries to capture different aspects:
QUERY_EXPANSION_PROMPT = """
Given this user message, generate 3-5 search queries that would find
relevant memories. Consider:
1. Direct topic matches
2. Related preferences
3. Past similar interactions
4. Relevant context that might not be explicitly mentioned
User message: {message}
Return as JSON array of query strings.
"""
def expanded_retrieval(message: str, user_id: str, llm) -> list[str]:
"""Use multiple queries for better recall."""
# Generate expanded queries
expansion = llm.invoke(QUERY_EXPANSION_PROMPT.format(message=message))
queries = json.loads(expansion.content)
# Retrieve for each query
all_results = []
seen_ids = set()
for query in queries:
results = vector_store.similarity_search(
query, k=3, filter={"user_id": user_id}
)
for doc in results:
if doc.id not in seen_ids:
all_results.append(doc)
seen_ids.add(doc.id)
return all_results
Strategy 3: Time-Weighted Retrieval
Recent memories often matter more than ancient ones:
from datetime import datetime, timedelta
def time_weighted_retrieval(query: str, user_id: str, k: int = 5) -> list:
"""
Retrieve memories with recency weighting.
"""
# Get more candidates than needed
candidates = vector_store.similarity_search(
query, k=k*3, filter={"user_id": user_id}
)
# Calculate combined score
now = datetime.utcnow()
scored_results = []
for doc in candidates:
similarity = doc.metadata.get("similarity", 0.5)
created = datetime.fromisoformat(doc.metadata["timestamp"])
age_days = (now - created).days
# Exponential decay: half-life of 30 days
recency_score = 0.5 ** (age_days / 30)
# Combined score (tune weights as needed)
combined = (0.7 * similarity) + (0.3 * recency_score)
scored_results.append((combined, doc))
# Sort and return top k
scored_results.sort(reverse=True, key=lambda x: x[0])
return [doc for _, doc in scored_results[:k]]
Strategy 4: Importance-Weighted Retrieval
Not all memories are equally important. Weight retrieval by significance:
def importance_weighted_retrieval(query: str, user_id: str, k: int = 5) -> list:
"""
Combine semantic similarity with importance scores.
"""
candidates = vector_store.similarity_search(
query, k=k*3, filter={"user_id": user_id}
)
scored = []
for doc in candidates:
similarity = doc.metadata.get("similarity", 0.5)
importance = doc.metadata.get("importance", 0.5)
access_count = doc.metadata.get("access_count", 0)
# Memories accessed more often are likely more valuable
usage_boost = min(0.2, access_count * 0.02)
combined = (0.5 * similarity) + (0.4 * importance) + (0.1 * usage_boost)
scored.append((combined, doc))
scored.sort(reverse=True, key=lambda x: x[0])
# Update access counts for retrieved memories
for _, doc in scored[:k]:
self._increment_access_count(doc.id)
return [doc for _, doc in scored[:k]]
Context Injection: Formatting Memories for LLM Consumption
Retrieved memories must be formatted for effective context injection. The format significantly impacts how well the LLM uses the information.
Pattern 1: Structured Section
def format_memories_structured(memories: dict) -> str:
"""Format memories as labeled sections."""
sections = []
if memories.get("user_profile"):
profile = memories["user_profile"]
sections.append(f"""## About This User
- Name: {profile.get('name', 'Unknown')}
- Preferences: {', '.join(profile.get('preferences', []))}
- Communication style: {profile.get('style', 'Not specified')}""")
if memories.get("recent_context"):
sections.append(f"""## Recent Interactions
{chr(10).join('- ' + m for m in memories['recent_context'])}""")
if memories.get("relevant_history"):
sections.append(f"""## Relevant Past Discussions
{chr(10).join('- ' + m for m in memories['relevant_history'])}""")
return "\n\n".join(sections)
Pattern 2: Natural Language Summary
MEMORY_SUMMARY_PROMPT = """
Summarize these memories into a natural paragraph that provides context
for the upcoming conversation. Be concise but include all relevant details.
Memories:
{memories}
Write as a brief background note, not a list.
"""
def format_memories_natural(memories: list[str], llm) -> str:
"""Convert memory list to natural language summary."""
response = llm.invoke(
MEMORY_SUMMARY_PROMPT.format(memories="\n".join(memories))
)
return response.content
Pattern 3: Just-In-Time Injection
Rather than loading all memories upfront, inject them when relevant:
class JITMemoryInjector:
"""
Inject memories dynamically as conversation progresses.
"""
def __init__(self, memory_store, llm):
self.memory = memory_store
self.llm = llm
self.injected_ids = set()
def get_context_for_turn(self, user_id: str, current_message: str,
conversation_so_far: list) -> str:
"""
Determine what memories to inject for this specific turn.
"""
# Check if any memories should be triggered
relevant = self.memory.search(user_id, current_message, k=3)
new_memories = []
for mem in relevant:
if mem.id not in self.injected_ids:
# Check if memory is actually relevant to current context
if self._is_relevant(mem, current_message, conversation_so_far):
new_memories.append(mem)
self.injected_ids.add(mem.id)
if new_memories:
return f"\n[Context from memory: {self._format(new_memories)}]\n"
return ""
def _is_relevant(self, memory, message, conversation) -> bool:
"""Use LLM to verify memory is actually relevant."""
check = self.llm.invoke(f"""
Is this memory relevant to the current conversation context?
Memory: {memory.content}
Current message: {message}
Recent conversation: {conversation[-3:]}
Reply only YES or NO.
""")
return "YES" in check.content.upper()
Building with Dytto: Persistent Memory as a Service
While you can build all of this yourself, Dytto provides a complete context layer for AI applications that handles the complexity of persistent memory.
Why Developers Choose Dytto
1. Unified Memory API Instead of managing multiple storage backends, Dytto provides a single API for all memory types:
from dytto import DyttoClient
client = DyttoClient(api_key="your-key")
# Store any type of context
client.context.store(
user_id="user-123",
content="Prefers morning meetings, vegetarian, working on Q2 roadmap",
category="preferences"
)
# Retrieve with semantic understanding
relevant = client.context.search(
user_id="user-123",
query="scheduling a lunch meeting",
limit=5
)
2. Automatic Memory Extraction Dytto analyzes conversations and extracts memorable content without explicit prompting:
# After a conversation, just send it
client.observe(
user_id="user-123",
messages=conversation_history
)
# Dytto automatically extracts and stores relevant memories
3. Smart Retrieval Built-in weighting for recency, importance, and relevance—no custom scoring logic needed:
# Get context optimized for the current conversation
context = client.context.get(
user_id="user-123",
current_message="Let's schedule that follow-up",
max_tokens=2000 # Respects your context budget
)
4. Cross-Platform Persistence Memories sync across all your AI touchpoints—chatbots, voice assistants, email agents—creating unified user understanding.
Implementation Example
Here's a complete chatbot with Dytto-powered persistent memory:
from openai import OpenAI
from dytto import DyttoClient
openai_client = OpenAI()
dytto = DyttoClient(api_key="your-dytto-key")
def chat_with_memory(user_id: str, message: str, conversation: list) -> str:
# Retrieve relevant memories
memories = dytto.context.get(
user_id=user_id,
current_message=message,
max_tokens=1500
)
# Build context-aware prompt
system_prompt = f"""You are a helpful assistant with memory of past interactions.
What you know about this user:
{memories}
Use this context naturally. Don't explicitly mention "your memory" unless asked."""
messages = [
{"role": "system", "content": system_prompt},
*conversation,
{"role": "user", "content": message}
]
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages
)
assistant_message = response.choices[0].message.content
# Update conversation and observe for new memories
conversation.append({"role": "user", "content": message})
conversation.append({"role": "assistant", "content": assistant_message})
# Dytto extracts memorable content in the background
dytto.observe(user_id=user_id, messages=conversation[-4:])
return assistant_message
Best Practices for Production Memory Systems
1. Privacy First
User memories are sensitive. Implement proper controls:
class PrivacyAwareMemory:
def store(self, user_id: str, content: str, **kwargs):
# Never store obvious PII in raw form
sanitized = self._sanitize_pii(content)
# Log what's being stored (for user transparency)
self._audit_log(user_id, "store", sanitized)
# Respect user preferences
if not self._user_allows_memory(user_id):
return
self._storage.store(user_id, sanitized, **kwargs)
def delete_all(self, user_id: str):
"""Complete memory deletion for GDPR/privacy compliance."""
self._storage.delete_by_user(user_id)
self._audit_log(user_id, "delete_all", "User requested full deletion")
2. Memory Decay and Cleanup
Don't hoard forever. Implement intelligent cleanup:
def cleanup_stale_memories(user_id: str, max_age_days: int = 365):
"""Remove old, unused memories."""
cutoff = datetime.utcnow() - timedelta(days=max_age_days)
stale = memory_store.query(
user_id=user_id,
last_accessed_before=cutoff,
access_count_less_than=2
)
for memory in stale:
memory_store.archive(memory.id) # Archive, don't delete
3. Conflict Resolution
When new information contradicts old memories:
def handle_contradiction(user_id: str, old_memory: Memory, new_info: str):
"""
Handle conflicting information gracefully.
"""
# Option 1: Newest wins (simple)
old_memory.update(content=new_info, updated_at=datetime.utcnow())
# Option 2: Keep both with temporal markers
old_memory.metadata["superseded_at"] = datetime.utcnow().isoformat()
create_memory(user_id, new_info, supersedes=old_memory.id)
# Option 3: Ask for clarification
return f"I have conflicting information. Previously: '{old_memory.content}'. Now: '{new_info}'. Which is correct?"
4. Testing Memory Systems
Memory bugs are subtle. Test thoroughly:
def test_memory_persistence():
"""Verify memories survive session boundaries."""
user_id = "test-user"
# Store memory
memory.store(user_id, "User prefers dark mode")
# Simulate session restart
memory = MemorySystem() # Fresh instance
# Verify retrieval
results = memory.search(user_id, "display preferences")
assert any("dark mode" in r.content for r in results)
def test_memory_contradiction():
"""Verify newer info supersedes older."""
user_id = "test-user"
memory.store(user_id, "User's favorite color is blue")
memory.store(user_id, "User's favorite color is green") # Changed mind
results = memory.search(user_id, "favorite color")
# Should return green, not blue
assert "green" in results[0].content
Conclusion: Memory Is the Moat
The difference between a demo and a product is often memory. Users tolerate stateless interactions exactly once. After that, they expect the AI to know them—their preferences, their history, their context.
Building persistent memory isn't trivial. You need to handle multiple storage types, implement smart retrieval, extract memories without explicit instruction, and maintain privacy throughout. But the investment pays off in user retention, satisfaction, and the ability to build genuinely personalized experiences.
Whether you build your own memory layer or use a service like Dytto, the architectural patterns remain the same:
- Separate storage by memory type (episodic, semantic, procedural)
- Extract memories automatically from conversations
- Retrieve intelligently with multi-signal ranking
- Inject contextually without overwhelming the context window
- Respect privacy and give users control
Your users shouldn't have to repeat themselves. Build memory that lasts.
Building AI that needs to remember? Dytto provides a complete context layer for LLM applications—persistent memory, automatic extraction, and smart retrieval in a single API. Start free at dytto.app.