Benchmarks
Four established memory benchmarks and a stress test suite.
LocOMo (Long Conversation Memory)
Tests memory accuracy over extended multi-turn conversations. Questions categorized by type: single-hop, multi-hop, open-ended, temporal, and adversarial.
Results: 95% overall accuracy vs. 33% for full-history baselines. The largest gains are on temporal and multi-hop questions, where raw history dumps bury the relevant facts in noise.
LongMemEval
Evaluates long-term memory fidelity after compaction. Runs a conversation through hundreds of turns triggering multiple compaction events, then queries for facts stated early. Tests the full compaction → storage → retrieval → assembly pipeline.
MRCR (Multi-Round Conversational Retrieval)
Tests retrieval precision across topic switches. Measures whether retrieval surfaces the right segments without cross-contamination. This is where the context bleed gate and active tag skipping are tested.
AMB (Agent Memory Benchmark)
Tests memory in agentic contexts with tool_use/tool_result pairs, chain collapses, and interleaved planning discussions. Tests whether chain collapse preserves recoverable information, fact extraction captures decisions during tool use, and retrieval handles mixed content types.
Stress Tests
| Category | What It Tests |
|---|---|
| Topic cycling | Rapid switches between 10+ topics, verifying retrieval stability |
| Compaction cascade | 200+ turns forcing multiple compaction events, checking for content loss |
| Tag explosion | Conversations generating 100+ unique tags, testing index performance |
| Concurrent access | Multiple simultaneous requests against the same session |
| Large payloads | Messages with images, code blocks, and tool results exceeding 50K tokens |
| Contradiction storms | Sequences of contradictory facts, testing supersession |
Running Benchmarks
# Run a specific benchmark
python -m benchmarks.locomo.runner --config virtual-context.yaml
# Run with a specific provider
python -m benchmarks.locomo.runner --provider anthropic --model claude-sonnet-4-20250514
# Stress tests via proxy dashboard Replay panel
virtual-context proxy --upstream https://api.anthropic.com
# Open http://localhost:8100/dashboardInterpreting Results
- Accuracy by question type is the primary metric. Overall accuracy can mask weaknesses.
- Tokens freed measures compaction efficiency. Higher is better, but not at the cost of accuracy.
- Retrieval precision measures what fraction of retrieved segments were relevant.
- Compression ratio is summary tokens / original tokens. Typical: 0.15-0.25 (4x-7x compression).