Fast Memory Retrieval (Python)
Optimize memory retrieval speed for AI agents using smart indexing strategies, caching patterns, and query optimization in Python.
Why Retrieval Speed Matters
When your AI agent queries memory during a conversation, every millisecond counts. A user waiting 2 seconds for a response feels the lag. At 200ms, it feels instant. The difference? Smart indexing and retrieval optimization.
In this guide, we'll explore practical Python patterns to achieve sub-100ms memory retrieval, even with thousands of stored memories. You'll learn indexing strategies, caching techniques, and query optimization that work at scale.
The Retrieval Bottlenecks
Before optimizing, understand where time goes. Typical memory retrieval involves three expensive operations:
- Embedding generation: Converting query text to vectors (50-200ms)
- Similarity search: Finding nearest neighbors (10-500ms depending on index)
- Metadata filtering: Applying tag/project filters (5-50ms)
Let's attack each one systematically.
Strategy 1: Smart Indexing
The right index structure makes or breaks retrieval performance. For memory stores, you need both vector indexes and traditional B-tree indexes working together.
Vector Index Selection
Not all vector indexes are equal. Here's a quick comparison for memory retrieval use cases:
# HNSW (Hierarchical Navigable Small World)
# Best for: < 1M memories, high recall requirements
# Trade-off: More memory, faster search
# IVF (Inverted File Index)
# Best for: > 1M memories, can tolerate lower recall
# Trade-off: Less memory, requires training
from hnswlib import Index
def create_hnsw_index(dim: int = 1536, max_elements: int = 100000):
index = Index(space='cosine', dim=dim)
index.init_index(
max_elements=max_elements,
ef_construction=200, # Higher = better recall, slower build
M=16 # Higher = more memory, better recall
)
index.set_ef(50) # Search-time parameter
return index Composite Indexes for Filtering
When you filter by project or memory type before vector search, composite indexes prevent full scans:
# SQL index strategy for memory tables
CREATE INDEX idx_memory_project_type ON memories(project_id, memory_type);
CREATE INDEX idx_memory_tags ON memories USING GIN(tags);
CREATE INDEX idx_memory_created ON memories(created_at DESC);
# In Python with SQLAlchemy
from sqlalchemy import Index
Index('idx_project_type', Memory.project_id, Memory.memory_type)
Index('idx_tags', Memory.tags, postgresql_using='gin') Strategy 2: Query-Time Caching
Embedding generation is often the slowest step. Cache aggressively but intelligently.
from functools import lru_cache
from hashlib import sha256
import redis
class EmbeddingCache:
def __init__(self, redis_client: redis.Redis, ttl: int = 3600):
self.redis = redis_client
self.ttl = ttl
def _cache_key(self, text: str) -> str:
return f"emb:{sha256(text.encode()).hexdigest()[:16]}"
async def get_embedding(self, text: str) -> list[float] | None:
key = self._cache_key(text)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
async def set_embedding(self, text: str, embedding: list[float]):
key = self._cache_key(text)
self.redis.setex(key, self.ttl, json.dumps(embedding)) For local development, use an in-memory LRU cache. In production, Redis gives you shared caching across instances.
Strategy 3: Pre-filtering Before Vector Search
Don't search all memories when you only need project-specific ones. Pre-filter reduces the search space dramatically.
class FastMemoryRetriever:
def __init__(self, store: MemoryStore, index: VectorIndex):
self.store = store
self.index = index
# Maintain per-project sub-indexes
self.project_indexes: dict[str, VectorIndex] = {}
async def retrieve(
self,
query: str,
project_id: str | None = None,
memory_types: list[str] | None = None,
limit: int = 10
) -> list[Memory]:
# Use project-specific index if available
if project_id and project_id in self.project_indexes:
search_index = self.project_indexes[project_id]
else:
search_index = self.index
embedding = await self.get_or_create_embedding(query)
# Search with 3x limit, then filter
candidates = search_index.search(embedding, k=limit * 3)
# Apply metadata filters in Python (fast on small sets)
results = []
for id, score in candidates:
memory = await self.store.get(id)
if memory_types and memory.type not in memory_types:
continue
results.append(memory)
if len(results) >= limit:
break
return results Strategy 4: Async Parallel Retrieval
When you need memories from multiple sources, parallelize the requests:
import asyncio
async def parallel_retrieve(
query: str,
retrievers: list[MemoryRetriever]
) -> list[Memory]:
tasks = [r.retrieve(query) for r in retrievers]
results = await asyncio.gather(*tasks)
# Merge and deduplicate by ID
seen = set()
merged = []
for batch in results:
for memory in batch:
if memory.id not in seen:
seen.add(memory.id)
merged.append(memory)
# Re-rank merged results by relevance
return sorted(merged, key=lambda m: m.score, reverse=True)[:10] Strategy 5: Tiered Retrieval
Not every query needs the full memory scan. Implement a fast path for common patterns:
class TieredRetriever:
async def retrieve(self, query: str, context: QueryContext) -> list[Memory]:
# Tier 1: Hot cache (< 5ms)
hot_results = self.hot_cache.get(context.session_id)
if hot_results and self._matches_query(hot_results, query):
return hot_results
# Tier 2: Recent memories (< 20ms)
recent = await self.store.get_recent(
project_id=context.project_id,
limit=50,
since=datetime.utcnow() - timedelta(hours=1)
)
if self._has_relevant(recent, query):
return self._filter_relevant(recent, query)[:10]
# Tier 3: Full vector search (< 100ms)
return await self.vector_search(query, context) Benchmarking Your Retrieval
Always measure. Here's a simple benchmarking pattern:
import time
from statistics import mean, stdev
async def benchmark_retrieval(retriever, queries: list[str], runs: int = 100):
latencies = []
for _ in range(runs):
query = random.choice(queries)
start = time.perf_counter()
await retriever.retrieve(query)
latencies.append((time.perf_counter() - start) * 1000)
print(f"p50: {sorted(latencies)[len(latencies)//2]:.1f}ms")
print(f"p95: {sorted(latencies)[int(len(latencies)*0.95)]:.1f}ms")
print(f"p99: {sorted(latencies)[int(len(latencies)*0.99)]:.1f}ms") Target p95 under 100ms for a responsive agent experience.
Ready for Fast Memory?
Building fast, reliable memory retrieval is complex—you need to balance indexing overhead, cache invalidation, and query optimization. CodeMem handles all of this out of the box, with sub-50ms p95 latency on our managed infrastructure.