Measuring Memory Quality (Python)
Learn how to measure and validate AI memory quality using precision, recall, and relevance scoring metrics in Python.
Why Memory Quality Matters
Your AI agent is only as good as its memory. If the retrieval system returns irrelevant or outdated information, your agent will hallucinate, contradict itself, or frustrate users. But how do you know if your memory system is working well?
The answer lies in systematic measurement. Just as you write tests for your code, you should measure the quality of your memory retrieval. In this tutorial, we'll explore the key metrics for evaluating memory quality and build a practical testing framework in Python.
Core Metrics for Memory Quality
Before diving into code, let's understand the three fundamental metrics that matter most for memory systems.
1. Precision: Are Retrieved Memories Relevant?
Precision measures how many of the retrieved memories are actually relevant to the query. High precision means less noise in your context window.
# Precision = Relevant Retrieved / Total Retrieved
# If you retrieve 10 memories and 7 are relevant, precision = 0.7
def calculate_precision(retrieved: list[str], relevant: set[str]) -> float:
"""Calculate precision for a retrieval result."""
if not retrieved:
return 0.0
relevant_retrieved = sum(1 for m in retrieved if m in relevant)
return relevant_retrieved / len(retrieved) 2. Recall: Did We Find All Relevant Memories?
Recall measures how many of the relevant memories in your store were actually retrieved. High recall means you're not missing important context.
# Recall = Relevant Retrieved / Total Relevant
# If 10 memories are relevant and you retrieved 7 of them, recall = 0.7
def calculate_recall(retrieved: list[str], relevant: set[str]) -> float:
"""Calculate recall for a retrieval result."""
if not relevant:
return 0.0
relevant_retrieved = sum(1 for m in retrieved if m in relevant)
return relevant_retrieved / len(relevant) 3. Relevance Score: How Good Is the Ranking?
Even if you retrieve the right memories, order matters. Relevance scoring evaluates whether the most important memories appear first, maximizing LLM context utilization.
def calculate_dcg(relevance_scores: list[float], k: int = 10) -> float:
"""Discounted Cumulative Gain - rewards relevant items appearing early."""
dcg = 0.0
for i, score in enumerate(relevance_scores[:k]):
dcg += score / math.log2(i + 2) # +2 because log2(1) = 0
return dcg
def calculate_ndcg(retrieved_scores: list[float], ideal_scores: list[float], k: int = 10) -> float:
"""Normalized DCG - compares actual ranking to ideal ranking."""
dcg = calculate_dcg(retrieved_scores, k)
idcg = calculate_dcg(sorted(ideal_scores, reverse=True), k)
return dcg / idcg if idcg > 0 else 0.0 Building a Memory Quality Test Framework
Now let's build a practical framework for testing memory quality. The key is creating labeled test cases with known ground truth.
from dataclasses import dataclass
from typing import Callable
import math
@dataclass
class MemoryTestCase:
"""A test case for memory retrieval quality."""
query: str # The retrieval query
relevant_ids: set[str] # IDs of memories that SHOULD be retrieved
relevance_scores: dict[str, float] = None # Optional: graded relevance (0-1)
@dataclass
class QualityReport:
"""Results from running quality tests."""
precision: float
recall: float
f1_score: float
ndcg: float
mean_reciprocal_rank: float
def __str__(self) -> str:
return (
f"Precision: {self.precision:.3f} | "
f"Recall: {self.recall:.3f} | "
f"F1: {self.f1_score:.3f} | "
f"NDCG: {self.ndcg:.3f} | "
f"MRR: {self.mean_reciprocal_rank:.3f}"
) The Quality Evaluator
Here's the core evaluator class that runs test cases against your memory retrieval function:
class MemoryQualityEvaluator:
def __init__(self, retrieve_fn: Callable[[str, int], list[str]]):
"""
Initialize with your retrieval function.
retrieve_fn(query, k) -> list of memory IDs
"""
self.retrieve_fn = retrieve_fn
def evaluate(self, test_cases: list[MemoryTestCase], k: int = 10) -> QualityReport:
"""Run all test cases and aggregate metrics."""
precisions, recalls, ndcgs, mrrs = [], [], [], []
for case in test_cases:
retrieved = self.retrieve_fn(case.query, k)
# Calculate precision and recall
precision = calculate_precision(retrieved, case.relevant_ids)
recall = calculate_recall(retrieved, case.relevant_ids)
precisions.append(precision)
recalls.append(recall)
# Calculate MRR (position of first relevant result)
mrr = 0.0
for i, mem_id in enumerate(retrieved):
if mem_id in case.relevant_ids:
mrr = 1.0 / (i + 1)
break
mrrs.append(mrr)
# Calculate NDCG if we have graded relevance
if case.relevance_scores:
retrieved_scores = [case.relevance_scores.get(m, 0) for m in retrieved]
ideal_scores = list(case.relevance_scores.values())
ndcgs.append(calculate_ndcg(retrieved_scores, ideal_scores, k))
avg_precision = sum(precisions) / len(precisions)
avg_recall = sum(recalls) / len(recalls)
f1 = 2 * avg_precision * avg_recall / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
return QualityReport(
precision=avg_precision,
recall=avg_recall,
f1_score=f1,
ndcg=sum(ndcgs) / len(ndcgs) if ndcgs else 0.0,
mean_reciprocal_rank=sum(mrrs) / len(mrrs)
) Creating Effective Test Cases
The quality of your evaluation depends on good test cases. Here's how to create them systematically:
# Example test cases for a coding assistant memory system
test_cases = [
MemoryTestCase(
query="user's preferred indentation style",
relevant_ids={"mem_001", "mem_042"}, # Known preference memories
relevance_scores={"mem_001": 1.0, "mem_042": 0.8, "mem_103": 0.2}
),
MemoryTestCase(
query="why we chose PostgreSQL",
relevant_ids={"mem_015", "mem_016", "mem_017"}, # Decision memories
relevance_scores={"mem_015": 1.0, "mem_016": 0.9, "mem_017": 0.7}
),
MemoryTestCase(
query="current refactoring task",
relevant_ids={"mem_200"}, # Context memory
),
]
# Run evaluation
evaluator = MemoryQualityEvaluator(my_memory_store.search)
report = evaluator.evaluate(test_cases, k=5)
print(report)
# Precision: 0.867 | Recall: 0.778 | F1: 0.820 | NDCG: 0.912 | MRR: 0.944 Setting Quality Thresholds
What constitutes "good" memory quality? Here are recommended thresholds based on production systems:
- Precision ≥ 0.80: At least 80% of retrieved memories should be relevant
- Recall ≥ 0.70: Find at least 70% of relevant memories
- NDCG ≥ 0.85: Most relevant memories should rank in top positions
- MRR ≥ 0.90: First relevant result should typically be in top 2 positions
def assert_quality_thresholds(report: QualityReport):
"""Use in CI/CD to catch quality regressions."""
assert report.precision >= 0.80, f"Precision {report.precision} below 0.80"
assert report.recall >= 0.70, f"Recall {report.recall} below 0.70"
assert report.ndcg >= 0.85, f"NDCG {report.ndcg} below 0.85"
assert report.mean_reciprocal_rank >= 0.90, f"MRR {report.mrr} below 0.90"
print("✓ All quality thresholds passed") Continuous Quality Monitoring
Integrate quality checks into your development workflow:
- Pre-commit: Run quality tests before merging retrieval changes
- Nightly: Evaluate against larger test sets to catch drift
- Production: Sample real queries and manually label relevance periodically
- A/B testing: Compare retrieval strategies with statistical significance
Start Measuring Today
Memory quality isn't a "set and forget" concern. As your agent's memory grows, retrieval quality can degrade without proper monitoring. By implementing these metrics early, you'll catch regressions before they impact users.
CodeMem provides built-in quality analytics so you can monitor precision, recall, and relevance scores across your memory stores without building this infrastructure yourself.