Context Management System for LLMs with Memory

virtual-context lives within the conversation

Other memory systems are external to the conversation. They store facts in a database and retrieve them at query time, hoping the right thing surfaces. virtual-context is different. It sits inside your conversation flow, sees every turn, manages the context window in real-time, and gives the model tools to navigate its own memory. Nothing changes in your code except a base URL.

How it works

Swap your base URL. virtual-context handles the rest.

Every turn is tagged, compressed, and indexed. The model gets retrieval tools to navigate its own memory within a managed token budget.

Turn 1

Normal conversation

virtual-context forwards transparently. Zero overhead, zero changes. The model doesn’t know it’s there.

client → virtual-context → LLM

Turn 10

Topics emerge

Every turn is automatically tagged by topic. A vocabulary builds through conversation with no predefined categories and no configuration. The model still doesn’t know.

auth-middleware react-migration deploy-pipeline

Turn 60

The model gets tools

expand_topic, collapse_topic, find_quote, query_facts, remember_when, recall_all. The model navigates its own memory: drilling into topics, searching for specific quotes, querying structured facts. Up to 10 tool rounds run transparently within a single user-visible response.

expand_topic collapse_topic find_quote query_facts

Turn 50

Context window fills. Compression fires.

Stale turns become layered summaries: raw turns, segment summaries, tag summaries. Facts are extracted and indexed with full conversational context. The model’s context gets smaller and denser. A 50K ceiling means compression fires early and often, keeping the window curated.

1M token model → 50K managed window = 95% less per call

Turn 100

Structured facts are verified

Per-turn fact signals from earlier turns are consolidated against full multi-turn segments at compaction. Two chances to get it right, each with progressively more context. Facts carry provenance: subject, verb, what, temporal status, source turns.

subject–verb–object status tracking queryable

Turn 500

Full coherence

The model recalls a decision from turn 12 by expanding that topic. A managed 50K window outperforms a raw 1M window because every token carries signal. Cross-vocabulary retrieval bridges “caching trick” to “materialized view” across 400 turns of drift. The conversation has been compressed dozens of times. Nothing was lost; it was reorganized.

turn 12 decision → recalled at turn 480 · 95% cheaper

Integration

One line. Any provider.

# Add this alias to your shell profile:
alias claude-vc='ANTHROPIC_BASE_URL="https://anthropic.virtual-context.com/?vckey=vc-YOUR_KEY" claude'

# Then launch Claude Code with virtual context:
claude-vc

One alias. Infinite memory for every Claude Code session.

// ~/.openclaw/openclaw.json
// In models.providers, change the baseUrl for your provider:

"anthropic-apikey": {
  "baseUrl": "https://anthropic.virtual-context.com?vckey=vc-YOUR_KEY",
  "api": "anthropic-messages",
  "apiKey": "sk-ant-...",   // your normal Anthropic key
  "models": [...]              // keep your existing models
}

// For OpenAI models, use path-based vckey with /v1 at the end:
"openai": {
  "baseUrl": "https://openai.virtual-context.com/vc-YOUR_KEY/v1",
  "api": "openai-responses",
  "apiKey": "sk-...",
  "models": [...]
}
// OpenClaw appends /chat/completions or /responses depending on the api setting

Works with Anthropic, OpenAI, and all supported providers.

pip install virtual-context[all]
virtual-context onboard --wizard
virtual-context proxy \
  --upstream https://api.anthropic.com

Local Ollama for tagging. SQLite storage. Zero external dependencies. AGPL-3.0.

curl https://anthropic.virtual-context.com/v1/messages?vckey=vc-YOUR_KEY \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "content-type: application/json" \
  -d '{"model":"claude-sonnet-4-20250514","max_tokens":1024,
     "messages":[{"role":"user","content":"Hello"}]}'

Raw HTTP. Works with any language or tool.

HTTP Bridge

Sits between any LLM client and upstream provider. Auto-detects Anthropic, OpenAI, Gemini formats. Zero client changes.

MCP Server

9 tools for Claude Desktop, Cursor, or any MCP client. recall_context, find_quote, query_facts, and more.

Python SDK

Two calls: on_message_inbound() before the LLM, on_turn_complete() after. Plus ingest_document().

Cloud

Managed infrastructure at *.virtual-context.com. Seven provider subdomains. Same API as self-hosted.

How it compares

	KB Retrieval	RAG	Context Reduction	Virtual Context
What is stored	Isolated facts (“likes pizza”)	Document chunks	Compressed history blob	Layered: summaries + original text + structured facts (nothing is discarded)
Context management	None (active session grows unchecked)	Append chunks, never free space	Compress to fit, can’t undo	Automatic compression keeps context trim and relevant. Model can also expand or collapse topics on demand
Recall precision	Re-search vector DB, hope for a match	Depends on chunk boundaries	Lost after summarization	Relevant context surfaces automatically by topic. Full-text search, structured fact lookup, and time-scoped recall available when needed
What the model knows about its memory	Nothing (retrieval is external)	Nothing (retrieval is external)	Knows it was summarized, can’t act on it	Sees all available topics, token costs, and depth levels. Can navigate its own memory
Cost at scale	Grows with corpus size	Grows with corpus size	Grows with conversation (1M+ tokens)	Configurable ceiling (50K stays flat)
Tool-heavy agents	No handling (tool outputs fill context unchecked)	N/A	No handling	Tool outputs automatically intercepted, truncated, and indexed. Full content searchable on demand
Best fit	Simple preference lookup	Doc retrieval	Long-chat cost reduction	All of the above, with coherent reasoning at turn 500

Turn 500 = Turn 5

Answer quality doesn’t degrade

Compression concentrates attention on signal: less noise, better reasoning. The model recalls decisions from turn 12 at turn 480 because the context window is managed, not accumulated.

~95% fewer tokens

50K managed window vs 1M raw context

Run a 1M-token model at a 50K managed ceiling. Compression fires early and often, keeping only curated context. You pay for 50K tokens per request, not 1M.

Built for tool-heavy agents

Tool results don’t blow up your context

Tool outputs fill context fast. A single code search can return thousands of tokens. VC intercepts tool outputs, truncates what’s shown, indexes the full content for on-demand search. Coding, legal doc review, data analysis: anything with interleaved tool chains.

Supported providers

Anthropicanthropic.virtual-context.com

OpenAIopenai.virtual-context.com

Geminigemini.virtual-context.com

Groqgroq.virtual-context.com

Mistralmistral.virtual-context.com

Togethertogether.virtual-context.com

Benchmarks

Structured context beats raw context at every tier.

95%

LongMemEval (100 random questions)

vs 33% full-context baseline using the same mid-tier model. ICLR 2025 dataset.

Category	VC	Baseline	Delta
Knowledge-update	100%	29.4%	+70.6pp
Multi-session	88.5%	15.4%	+73.1pp
Temporal-reasoning	92.9%	32.1%	+60.8pp
Single-session (user)	100%	46.2%	+53.8pp
Single-session (assistant)	100%	72.7%	+27.3pp
Single-session (preference)	100%	20.0%	+80.0pp

2.2x

Token reduction

52K managed window vs 118K raw context. $0.16/question vs $0.36. Same accuracy, less than half the cost.

82%

Answered in 1-2 tool calls

8 emergent retrieval patterns. The reader model learns to navigate memory efficiently without hand-crafted retrieval strategies.

FAQ

Common questions about context management and memory for LLMs

Virtual Context is built for teams that need persistent memory, lower token overhead, and better long-session recall without forcing the model to reread the whole transcript every turn.

What makes Virtual Context a context management system instead of just a larger context window?

Virtual Context does not try to push an ever-larger raw transcript into the model. It manages conversation state outside the active prompt, compacts older material by topic, and retrieves the most relevant memory when the model needs it. That makes it a context management system, not just a bigger prompt budget.

How does memory stay available without replaying the full conversation every turn?

The system segments conversation history, maintains summaries and structured facts, and uses retrieval to reassemble only the parts that matter for the next response. Older material stays recoverable, but it does not have to be resent on every request. This keeps memory available while controlling cost and prompt size.

Does Virtual Context work with existing LLM providers and SDKs?

Yes. The core product is designed to sit in front of OpenAI-compatible APIs as a proxy, so the usual integration is a base URL change rather than an SDK rewrite. It is built to work with Anthropic, OpenAI, Gemini, Groq, Mistral, Together, and similar providers.

Can I self-host Virtual Context, or is it only a hosted product?

You can do either. The core engine is open source under AGPL-3.0 for self-hosted deployments, and the managed product adds hosted infrastructure, tenant provisioning, billing, and dashboard tooling around the same engine.

When should I use Virtual Context instead of relying on plain long-context prompting?

It is most useful when conversations run long, tools produce large outputs, or accuracy depends on recalling decisions made far earlier in the session. In those cases, sending the full raw history tends to get expensive and noisy. Virtual Context is built to preserve recall while keeping the active prompt curated.

Start free. Ship memory in minutes.

Cloud or self-hosted. Same engine, same API, full control of your provider keys.

Get Started Free

View Pricing

Open source core on GitHub. pip install virtual-context

100M Context Window. Virtualized.