Virtual Context was evaluated against established long-context benchmarks using ICLR 2025 datasets. This page presents the methodology, results, and analysis of the system’s performance on LongMemEval and token efficiency metrics compared to full-context baselines.
The headline results are 95% overall accuracy on LongMemEval, 100% accuracy on knowledge-update questions that require tracking changing facts across long conversations, and 2.2x fewer tokens per request compared to full-context baselines using Claude Sonnet. These results demonstrate that structured context management can maintain or exceed the recall quality of sending the full raw transcript while significantly reducing cost and latency.
For details on how the engine achieves these results through hierarchical compression and retrieval, see the engine internals. For the complete methodology and analysis, read the research paper.
Benchmarks
Four established memory benchmarks and a stress test suite.
LocOMo (Long Conversation Memory)
Tests memory accuracy over extended multi-turn conversations. Questions categorized by type: single-hop, multi-hop, open-ended, temporal, and adversarial.
Results: 95% overall accuracy vs. 33% for full-history baselines. The largest gains are on temporal and multi-hop questions, where raw history dumps bury the relevant facts in noise.
LongMemEval
Evaluates long-term memory fidelity after compaction. Runs a conversation through hundreds of turns triggering multiple compaction events, then queries for facts stated early. Tests the full compaction → storage → retrieval → assembly pipeline.
MRCR (Multi-Round Conversational Retrieval)
Tests retrieval precision across topic switches. Measures whether retrieval surfaces the right segments without cross-contamination. This is where the context bleed gate and active tag skipping are tested.
AMB (Agent Memory Benchmark)
Tests memory in agentic contexts with tool_use/tool_result pairs, chain collapses, and interleaved planning discussions. Tests whether chain collapse preserves recoverable information, fact extraction captures decisions during tool use, and retrieval handles mixed content types.
Stress Tests
| Category | What It Tests |
|---|---|
| Topic cycling | Rapid switches between 10+ topics, verifying retrieval stability |
| Compaction cascade | 200+ turns forcing multiple compaction events, checking for content loss |
| Tag explosion | Conversations generating 100+ unique tags, testing index performance |
| Concurrent access | Multiple simultaneous requests against the same session |
| Large payloads | Messages with images, code blocks, and tool results exceeding 50K tokens |
| Contradiction storms | Sequences of contradictory facts, testing supersession |
Running Benchmarks
# Run a specific benchmark
python -m benchmarks.locomo.runner --config virtual-context.yaml
# Run with a specific provider
python -m benchmarks.locomo.runner --provider anthropic --model claude-sonnet-4-20250514
# Stress tests via proxy dashboard Replay panel
virtual-context proxy --upstream https://api.anthropic.com
# Open http://localhost:8100/dashboardInterpreting Results
- Accuracy by question type is the primary metric. Overall accuracy can mask weaknesses.
- Tokens freed measures compaction efficiency. Higher is better, but not at the cost of accuracy.
- Retrieval precision measures what fraction of retrieved segments were relevant.
- Compression ratio is summary tokens / original tokens. Typical: 0.15-0.25 (4x-7x compression).