python metrics quality testing

Measuring Memory Quality (Python)

Learn how to measure and validate AI memory quality using precision, recall, and relevance scoring metrics in Python.

CodeMem Team

Why Memory Quality Matters

Your AI agent is only as good as its memory. If the retrieval system returns irrelevant or outdated information, your agent will hallucinate, contradict itself, or frustrate users. But how do you know if your memory system is working well?

The answer lies in systematic measurement. Just as you write tests for your code, you should measure the quality of your memory retrieval. In this tutorial, we'll explore the key metrics for evaluating memory quality and build a practical testing framework in Python.

Core Metrics for Memory Quality

Before diving into code, let's understand the three fundamental metrics that matter most for memory systems.

1. Precision: Are Retrieved Memories Relevant?

Precision measures how many of the retrieved memories are actually relevant to the query. High precision means less noise in your context window.

# Precision = Relevant Retrieved / Total Retrieved
# If you retrieve 10 memories and 7 are relevant, precision = 0.7

def calculate_precision(retrieved: list[str], relevant: set[str]) -> float:
    """Calculate precision for a retrieval result."""
    if not retrieved:
        return 0.0
    relevant_retrieved = sum(1 for m in retrieved if m in relevant)
    return relevant_retrieved / len(retrieved)

2. Recall: Did We Find All Relevant Memories?

Recall measures how many of the relevant memories in your store were actually retrieved. High recall means you're not missing important context.

# Recall = Relevant Retrieved / Total Relevant
# If 10 memories are relevant and you retrieved 7 of them, recall = 0.7

def calculate_recall(retrieved: list[str], relevant: set[str]) -> float:
    """Calculate recall for a retrieval result."""
    if not relevant:
        return 0.0
    relevant_retrieved = sum(1 for m in retrieved if m in relevant)
    return relevant_retrieved / len(relevant)

3. Relevance Score: How Good Is the Ranking?

Even if you retrieve the right memories, order matters. Relevance scoring evaluates whether the most important memories appear first, maximizing LLM context utilization.

def calculate_dcg(relevance_scores: list[float], k: int = 10) -> float:
    """Discounted Cumulative Gain - rewards relevant items appearing early."""
    dcg = 0.0
    for i, score in enumerate(relevance_scores[:k]):
        dcg += score / math.log2(i + 2)  # +2 because log2(1) = 0
    return dcg

def calculate_ndcg(retrieved_scores: list[float], ideal_scores: list[float], k: int = 10) -> float:
    """Normalized DCG - compares actual ranking to ideal ranking."""
    dcg = calculate_dcg(retrieved_scores, k)
    idcg = calculate_dcg(sorted(ideal_scores, reverse=True), k)
    return dcg / idcg if idcg > 0 else 0.0

Building a Memory Quality Test Framework

Now let's build a practical framework for testing memory quality. The key is creating labeled test cases with known ground truth.

from dataclasses import dataclass
from typing import Callable
import math

@dataclass
class MemoryTestCase:
    """A test case for memory retrieval quality."""
    query: str                        # The retrieval query
    relevant_ids: set[str]            # IDs of memories that SHOULD be retrieved
    relevance_scores: dict[str, float] = None  # Optional: graded relevance (0-1)
    
@dataclass  
class QualityReport:
    """Results from running quality tests."""
    precision: float
    recall: float
    f1_score: float
    ndcg: float
    mean_reciprocal_rank: float
    
    def __str__(self) -> str:
        return (
            f"Precision: {self.precision:.3f} | "
            f"Recall: {self.recall:.3f} | "
            f"F1: {self.f1_score:.3f} | "
            f"NDCG: {self.ndcg:.3f} | "
            f"MRR: {self.mean_reciprocal_rank:.3f}"
        )

The Quality Evaluator

Here's the core evaluator class that runs test cases against your memory retrieval function:

class MemoryQualityEvaluator:
    def __init__(self, retrieve_fn: Callable[[str, int], list[str]]):
        """
        Initialize with your retrieval function.
        retrieve_fn(query, k) -> list of memory IDs
        """
        self.retrieve_fn = retrieve_fn
    
    def evaluate(self, test_cases: list[MemoryTestCase], k: int = 10) -> QualityReport:
        """Run all test cases and aggregate metrics."""
        precisions, recalls, ndcgs, mrrs = [], [], [], []
        
        for case in test_cases:
            retrieved = self.retrieve_fn(case.query, k)
            
            # Calculate precision and recall
            precision = calculate_precision(retrieved, case.relevant_ids)
            recall = calculate_recall(retrieved, case.relevant_ids)
            precisions.append(precision)
            recalls.append(recall)
            
            # Calculate MRR (position of first relevant result)
            mrr = 0.0
            for i, mem_id in enumerate(retrieved):
                if mem_id in case.relevant_ids:
                    mrr = 1.0 / (i + 1)
                    break
            mrrs.append(mrr)
            
            # Calculate NDCG if we have graded relevance
            if case.relevance_scores:
                retrieved_scores = [case.relevance_scores.get(m, 0) for m in retrieved]
                ideal_scores = list(case.relevance_scores.values())
                ndcgs.append(calculate_ndcg(retrieved_scores, ideal_scores, k))
        
        avg_precision = sum(precisions) / len(precisions)
        avg_recall = sum(recalls) / len(recalls)
        f1 = 2 * avg_precision * avg_recall / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
        
        return QualityReport(
            precision=avg_precision,
            recall=avg_recall,
            f1_score=f1,
            ndcg=sum(ndcgs) / len(ndcgs) if ndcgs else 0.0,
            mean_reciprocal_rank=sum(mrrs) / len(mrrs)
        )

Creating Effective Test Cases

The quality of your evaluation depends on good test cases. Here's how to create them systematically:

# Example test cases for a coding assistant memory system
test_cases = [
    MemoryTestCase(
        query="user's preferred indentation style",
        relevant_ids={"mem_001", "mem_042"},  # Known preference memories
        relevance_scores={"mem_001": 1.0, "mem_042": 0.8, "mem_103": 0.2}
    ),
    MemoryTestCase(
        query="why we chose PostgreSQL",
        relevant_ids={"mem_015", "mem_016", "mem_017"},  # Decision memories
        relevance_scores={"mem_015": 1.0, "mem_016": 0.9, "mem_017": 0.7}
    ),
    MemoryTestCase(
        query="current refactoring task",
        relevant_ids={"mem_200"},  # Context memory
    ),
]

# Run evaluation
evaluator = MemoryQualityEvaluator(my_memory_store.search)
report = evaluator.evaluate(test_cases, k=5)
print(report)
# Precision: 0.867 | Recall: 0.778 | F1: 0.820 | NDCG: 0.912 | MRR: 0.944

Setting Quality Thresholds

What constitutes "good" memory quality? Here are recommended thresholds based on production systems:

  • Precision ≥ 0.80: At least 80% of retrieved memories should be relevant
  • Recall ≥ 0.70: Find at least 70% of relevant memories
  • NDCG ≥ 0.85: Most relevant memories should rank in top positions
  • MRR ≥ 0.90: First relevant result should typically be in top 2 positions
def assert_quality_thresholds(report: QualityReport):
    """Use in CI/CD to catch quality regressions."""
    assert report.precision >= 0.80, f"Precision {report.precision} below 0.80"
    assert report.recall >= 0.70, f"Recall {report.recall} below 0.70"
    assert report.ndcg >= 0.85, f"NDCG {report.ndcg} below 0.85"
    assert report.mean_reciprocal_rank >= 0.90, f"MRR {report.mrr} below 0.90"
    print("✓ All quality thresholds passed")

Continuous Quality Monitoring

Integrate quality checks into your development workflow:

  • Pre-commit: Run quality tests before merging retrieval changes
  • Nightly: Evaluate against larger test sets to catch drift
  • Production: Sample real queries and manually label relevance periodically
  • A/B testing: Compare retrieval strategies with statistical significance

Start Measuring Today

Memory quality isn't a "set and forget" concern. As your agent's memory grows, retrieval quality can degrade without proper monitoring. By implementing these metrics early, you'll catch regressions before they impact users.

CodeMem provides built-in quality analytics so you can monitor precision, recall, and relevance scores across your memory stores without building this infrastructure yourself.

Start measuring your memory quality with CodeMem →