python performance retrieval optimization

Fast Memory Retrieval (Python)

Optimize memory retrieval speed for AI agents using smart indexing strategies, caching patterns, and query optimization in Python.

CodeMem Team

Why Retrieval Speed Matters

When your AI agent queries memory during a conversation, every millisecond counts. A user waiting 2 seconds for a response feels the lag. At 200ms, it feels instant. The difference? Smart indexing and retrieval optimization.

In this guide, we'll explore practical Python patterns to achieve sub-100ms memory retrieval, even with thousands of stored memories. You'll learn indexing strategies, caching techniques, and query optimization that work at scale.

The Retrieval Bottlenecks

Before optimizing, understand where time goes. Typical memory retrieval involves three expensive operations:

  • Embedding generation: Converting query text to vectors (50-200ms)
  • Similarity search: Finding nearest neighbors (10-500ms depending on index)
  • Metadata filtering: Applying tag/project filters (5-50ms)

Let's attack each one systematically.

Strategy 1: Smart Indexing

The right index structure makes or breaks retrieval performance. For memory stores, you need both vector indexes and traditional B-tree indexes working together.

Vector Index Selection

Not all vector indexes are equal. Here's a quick comparison for memory retrieval use cases:

# HNSW (Hierarchical Navigable Small World)
# Best for: < 1M memories, high recall requirements
# Trade-off: More memory, faster search

# IVF (Inverted File Index)
# Best for: > 1M memories, can tolerate lower recall
# Trade-off: Less memory, requires training

from hnswlib import Index

def create_hnsw_index(dim: int = 1536, max_elements: int = 100000):
    index = Index(space='cosine', dim=dim)
    index.init_index(
        max_elements=max_elements,
        ef_construction=200,  # Higher = better recall, slower build
        M=16                   # Higher = more memory, better recall
    )
    index.set_ef(50)  # Search-time parameter
    return index

Composite Indexes for Filtering

When you filter by project or memory type before vector search, composite indexes prevent full scans:

# SQL index strategy for memory tables
CREATE INDEX idx_memory_project_type ON memories(project_id, memory_type);
CREATE INDEX idx_memory_tags ON memories USING GIN(tags);
CREATE INDEX idx_memory_created ON memories(created_at DESC);

# In Python with SQLAlchemy
from sqlalchemy import Index

Index('idx_project_type', Memory.project_id, Memory.memory_type)
Index('idx_tags', Memory.tags, postgresql_using='gin')

Strategy 2: Query-Time Caching

Embedding generation is often the slowest step. Cache aggressively but intelligently.

from functools import lru_cache
from hashlib import sha256
import redis

class EmbeddingCache:
    def __init__(self, redis_client: redis.Redis, ttl: int = 3600):
        self.redis = redis_client
        self.ttl = ttl
    
    def _cache_key(self, text: str) -> str:
        return f"emb:{sha256(text.encode()).hexdigest()[:16]}"
    
    async def get_embedding(self, text: str) -> list[float] | None:
        key = self._cache_key(text)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None
    
    async def set_embedding(self, text: str, embedding: list[float]):
        key = self._cache_key(text)
        self.redis.setex(key, self.ttl, json.dumps(embedding))

For local development, use an in-memory LRU cache. In production, Redis gives you shared caching across instances.

Strategy 3: Pre-filtering Before Vector Search

Don't search all memories when you only need project-specific ones. Pre-filter reduces the search space dramatically.

class FastMemoryRetriever:
    def __init__(self, store: MemoryStore, index: VectorIndex):
        self.store = store
        self.index = index
        # Maintain per-project sub-indexes
        self.project_indexes: dict[str, VectorIndex] = {}
    
    async def retrieve(
        self,
        query: str,
        project_id: str | None = None,
        memory_types: list[str] | None = None,
        limit: int = 10
    ) -> list[Memory]:
        # Use project-specific index if available
        if project_id and project_id in self.project_indexes:
            search_index = self.project_indexes[project_id]
        else:
            search_index = self.index
        
        embedding = await self.get_or_create_embedding(query)
        
        # Search with 3x limit, then filter
        candidates = search_index.search(embedding, k=limit * 3)
        
        # Apply metadata filters in Python (fast on small sets)
        results = []
        for id, score in candidates:
            memory = await self.store.get(id)
            if memory_types and memory.type not in memory_types:
                continue
            results.append(memory)
            if len(results) >= limit:
                break
        
        return results

Strategy 4: Async Parallel Retrieval

When you need memories from multiple sources, parallelize the requests:

import asyncio

async def parallel_retrieve(
    query: str,
    retrievers: list[MemoryRetriever]
) -> list[Memory]:
    tasks = [r.retrieve(query) for r in retrievers]
    results = await asyncio.gather(*tasks)
    
    # Merge and deduplicate by ID
    seen = set()
    merged = []
    for batch in results:
        for memory in batch:
            if memory.id not in seen:
                seen.add(memory.id)
                merged.append(memory)
    
    # Re-rank merged results by relevance
    return sorted(merged, key=lambda m: m.score, reverse=True)[:10]

Strategy 5: Tiered Retrieval

Not every query needs the full memory scan. Implement a fast path for common patterns:

class TieredRetriever:
    async def retrieve(self, query: str, context: QueryContext) -> list[Memory]:
        # Tier 1: Hot cache (< 5ms)
        hot_results = self.hot_cache.get(context.session_id)
        if hot_results and self._matches_query(hot_results, query):
            return hot_results
        
        # Tier 2: Recent memories (< 20ms)
        recent = await self.store.get_recent(
            project_id=context.project_id,
            limit=50,
            since=datetime.utcnow() - timedelta(hours=1)
        )
        if self._has_relevant(recent, query):
            return self._filter_relevant(recent, query)[:10]
        
        # Tier 3: Full vector search (< 100ms)
        return await self.vector_search(query, context)

Benchmarking Your Retrieval

Always measure. Here's a simple benchmarking pattern:

import time
from statistics import mean, stdev

async def benchmark_retrieval(retriever, queries: list[str], runs: int = 100):
    latencies = []
    for _ in range(runs):
        query = random.choice(queries)
        start = time.perf_counter()
        await retriever.retrieve(query)
        latencies.append((time.perf_counter() - start) * 1000)
    
    print(f"p50: {sorted(latencies)[len(latencies)//2]:.1f}ms")
    print(f"p95: {sorted(latencies)[int(len(latencies)*0.95)]:.1f}ms")
    print(f"p99: {sorted(latencies)[int(len(latencies)*0.99)]:.1f}ms")

Target p95 under 100ms for a responsive agent experience.

Ready for Fast Memory?

Building fast, reliable memory retrieval is complex—you need to balance indexing overhead, cache invalidation, and query optimization. CodeMem handles all of this out of the box, with sub-50ms p95 latency on our managed infrastructure.

Start building with fast memory retrieval →