Benchmarks

Four established memory benchmarks and a stress test suite.

LocOMo (Long Conversation Memory)

Tests memory accuracy over extended multi-turn conversations. Questions categorized by type: single-hop, multi-hop, open-ended, temporal, and adversarial.

Results: 95% overall accuracy vs. 33% for full-history baselines. The largest gains are on temporal and multi-hop questions, where raw history dumps bury the relevant facts in noise.

LongMemEval

Evaluates long-term memory fidelity after compaction. Runs a conversation through hundreds of turns triggering multiple compaction events, then queries for facts stated early. Tests the full compaction → storage → retrieval → assembly pipeline.

MRCR (Multi-Round Conversational Retrieval)

Tests retrieval precision across topic switches. Measures whether retrieval surfaces the right segments without cross-contamination. This is where the context bleed gate and active tag skipping are tested.

AMB (Agent Memory Benchmark)

Tests memory in agentic contexts with tool_use/tool_result pairs, chain collapses, and interleaved planning discussions. Tests whether chain collapse preserves recoverable information, fact extraction captures decisions during tool use, and retrieval handles mixed content types.

Stress Tests

CategoryWhat It Tests
Topic cyclingRapid switches between 10+ topics, verifying retrieval stability
Compaction cascade200+ turns forcing multiple compaction events, checking for content loss
Tag explosionConversations generating 100+ unique tags, testing index performance
Concurrent accessMultiple simultaneous requests against the same session
Large payloadsMessages with images, code blocks, and tool results exceeding 50K tokens
Contradiction stormsSequences of contradictory facts, testing supersession

Running Benchmarks

# Run a specific benchmark
python -m benchmarks.locomo.runner --config virtual-context.yaml

# Run with a specific provider
python -m benchmarks.locomo.runner --provider anthropic --model claude-sonnet-4-20250514

# Stress tests via proxy dashboard Replay panel
virtual-context proxy --upstream https://api.anthropic.com
# Open http://localhost:8100/dashboard

Interpreting Results

  • Accuracy by question type is the primary metric. Overall accuracy can mask weaknesses.
  • Tokens freed measures compaction efficiency. Higher is better, but not at the cost of accuracy.
  • Retrieval precision measures what fraction of retrieved segments were relevant.
  • Compression ratio is summary tokens / original tokens. Typical: 0.15-0.25 (4x-7x compression).