The "Memory Budget" (JavaScript)

The Hidden Cost of Context

Every token you send to an LLM costs money. A typical Claude API call with 10,000 tokens of context runs about $0.03. Sounds cheap—until you're making hundreds of calls per hour in a production system. Suddenly, that's $72/day just on context tokens.

The problem? Most of that context is redundant. You're sending the same project conventions, the same user preferences, the same architecture decisions—over and over. This is where memory budgets become essential.

What is a Memory Budget?

A memory budget is the maximum number of tokens you allocate to retrieved memories per request. Instead of dumping everything into context, you retrieve only the most relevant memories within a strict token limit.

// Without budget: retrieve everything
const memories = await searchMemory({ query: "user preferences" });
// → 15,000 tokens of memories

// With budget: retrieve only what fits
const memories = await searchMemory({ 
  query: "user preferences",
  maxTokens: 2000 
});
// → 2,000 tokens of the MOST relevant memories

The key insight: semantic search already ranks memories by relevance. A budget simply cuts off at a token threshold, keeping the highest-value context while discarding the noise.

Implementing Token Budgets in JavaScript

Here's a practical implementation using the CodeMem SDK:

import { CodeMem } from '@codemem/sdk';
import { encoding_for_model } from 'tiktoken';

const mem = new CodeMem({ apiKey: process.env.CODEMEM_KEY });
const enc = encoding_for_model('claude-3-5-sonnet');

async function getMemoriesWithBudget(query, budgetTokens = 2000) {
  // Fetch more than we need, then trim
  const results = await mem.search({ query, limit: 20 });
  
  const selected = [];
  let usedTokens = 0;

  for (const memory of results) {
    const memoryTokens = enc.encode(memory.content).length;
    
    if (usedTokens + memoryTokens <= budgetTokens) {
      selected.push(memory);
      usedTokens += memoryTokens;
    } else {
      break; // Budget exhausted
    }
  }

  return { memories: selected, tokensUsed: usedTokens };
}

Dynamic Budget Allocation

Not all requests are equal. A simple question needs minimal context. A complex refactoring task needs more. Smart budgeting adapts:

function calculateBudget(taskComplexity, baseContext) {
  const MAX_CONTEXT = 100000; // Claude's limit
  const BASE_BUDGET = 2000;
  
  // Reserve space for: system prompt + user input + response
  const reserved = baseContext + 4000; // ~4k for response
  const available = MAX_CONTEXT - reserved;
  
  // Scale budget by complexity (1-5)
  const multiplier = Math.min(taskComplexity, 5);
  const budget = Math.min(BASE_BUDGET * multiplier, available * 0.3);
  
  return Math.floor(budget);
}

// Usage
const budget = calculateBudget(3, userMessageTokens);
const { memories } = await getMemoriesWithBudget(query, budget);

This ensures you never exceed context limits while maximizing memory utility for complex tasks.

Memory Prioritization Strategies

Beyond semantic relevance, consider these prioritization factors:

1. Recency Weighting

function scoreWithRecency(memory, semanticScore) {
  const daysSinceUpdate = (Date.now() - memory.updatedAt) / 86400000;
  const recencyBoost = Math.exp(-daysSinceUpdate / 30); // Decay over 30 days
  return semanticScore * (0.7 + 0.3 * recencyBoost);
}

2. Access Frequency

// Memories accessed often are likely important
function scoreWithFrequency(memory, semanticScore) {
  const frequencyBoost = Math.log10(memory.accessCount + 1) / 3;
  return semanticScore * (1 + frequencyBoost);
}

3. Tag-Based Boosting

// Boost memories with critical tags
const PRIORITY_TAGS = ['critical', 'architecture', 'security'];

function scoreWithTags(memory, semanticScore) {
  const hasPriorityTag = memory.tags.some(t => PRIORITY_TAGS.includes(t));
  return hasPriorityTag ? semanticScore * 1.5 : semanticScore;
}

Real Cost Savings

Let's do the math. Assume you're running a coding assistant with:

500 API calls per day
Without budgets: 8,000 tokens of memory per call
With 2,000 token budget: 2,000 tokens per call

At $3 per million input tokens (Claude Sonnet):

Without budget: 500 × 8,000 × $0.000003 = $12/day
With budget: 500 × 2,000 × $0.000003 = $3/day
Monthly savings: $270

That's 75% reduction in memory-related costs—with minimal impact on quality since you're keeping the most relevant memories.

Best Practices

Start small: Begin with a 2,000 token budget and increase only if needed
Monitor quality: Track response quality vs. budget size to find your sweet spot
Compress memories: Store concise, information-dense memories
Use tags wisely: Good tagging improves retrieval, reducing needed budget
Cache frequent queries: Same query patterns can reuse memory sets

The Budget-Quality Tradeoff

There's always a tradeoff. More context generally means better responses. The art is finding where returns diminish. In our testing:

0-1,000 tokens: Minimal context, good for simple queries
1,000-3,000 tokens: Sweet spot for most coding tasks
3,000-5,000 tokens: Complex architecture or multi-file changes
5,000+ tokens: Rarely needed, often wasteful

Pro tip: Log your budget usage alongside response quality scores. After a week of data, you'll know exactly where your optimal budget sits. Most teams find they were over-retrieving by 60-70%.

Start Optimizing Today

Memory budgets are one of the highest-impact optimizations you can make. They reduce costs, improve latency (fewer tokens = faster processing), and force you to write better, more focused memories.

CodeMem's search_memory tool already returns results ranked by relevance. Adding a token budget on top is straightforward—and the ROI is immediate.

Ready to optimize your AI costs?

Start with CodeMem's free tier—1,000 memories, semantic search, and the tools to implement smart budgeting from day one.

Get Started Free →