Virtual Context was evaluated against established long-context benchmarks using ICLR 2025 datasets. This page presents the methodology, results, and analysis of the system’s performance on LongMemEval and token efficiency metrics compared to full-context baselines.

The headline results are 95% overall accuracy on LongMemEval, 100% accuracy on knowledge-update questions that require tracking changing facts across long conversations, and 2.2x fewer tokens per request compared to full-context baselines using Claude Sonnet. These results demonstrate that structured context management can maintain or exceed the recall quality of sending the full raw transcript while significantly reducing cost and latency.

For details on how the engine achieves these results through hierarchical compression and retrieval, see the engine internals. For the complete methodology and analysis, read the research paper.

Benchmarks

Name: Virtual Context benchmark results
Creator: Virtual Context
License: https://github.com/virtual-context/virtual-context/blob/main/LICENSE

Four established memory benchmarks and a stress test suite.

LocOMo (Long Conversation Memory)

Tests memory accuracy over extended multi-turn conversations. Questions categorized by type: single-hop, multi-hop, open-ended, temporal, and adversarial.

Results: 95% overall accuracy vs. 33% for full-history baselines. The largest gains are on temporal and multi-hop questions, where raw history dumps bury the relevant facts in noise.

LongMemEval

Evaluates long-term memory fidelity after compaction. Runs a conversation through hundreds of turns triggering multiple compaction events, then queries for facts stated early. Tests the full compaction → storage → retrieval → assembly pipeline.

MRCR (Multi-Round Conversational Retrieval)

Tests retrieval precision across topic switches. Measures whether retrieval surfaces the right segments without cross-contamination. This is where the context bleed gate and active tag skipping are tested.

AMB (Agent Memory Benchmark)

Tests memory in agentic contexts with tool_use/tool_result pairs, chain collapses, and interleaved planning discussions. Tests whether chain collapse preserves recoverable information, fact extraction captures decisions during tool use, and retrieval handles mixed content types.

Stress Tests

Category	What It Tests
Topic cycling	Rapid switches between 10+ topics, verifying retrieval stability
Compaction cascade	200+ turns forcing multiple compaction events, checking for content loss
Tag explosion	Conversations generating 100+ unique tags, testing index performance
Concurrent access	Multiple simultaneous requests against the same session
Large payloads	Messages with images, code blocks, and tool results exceeding 50K tokens
Contradiction storms	Sequences of contradictory facts, testing supersession

Running Benchmarks

# Run a specific benchmark
python -m benchmarks.locomo.runner --config virtual-context.yaml

# Run with a specific provider
python -m benchmarks.locomo.runner --provider anthropic --model claude-sonnet-4-20250514

# Stress tests via proxy dashboard Replay panel
virtual-context proxy --upstream https://api.anthropic.com
# Open http://localhost:8100/dashboard

Interpreting Results

Accuracy by question type is the primary metric. Overall accuracy can mask weaknesses.
Tokens freed measures compaction efficiency. Higher is better, but not at the cost of accuracy.
Retrieval precision measures what fraction of retrieved segments were relevant.
Compression ratio is summary tokens / original tokens. Typical: 0.15-0.25 (4x-7x compression).