How Does an LLM Context Window Actually Work?

Paste a 40-page contract into Claude and ask a question about page 38. The answer comes back clean. Now paste the same document and ask about something you mentioned on page 2 of a long back-and-forth conversation - and the model acts like it never happened. Same model, same document size, completely different result. That gap is the context window at work, and understanding it changes how you build anything on top of an LLM.

What a context window actually contains

The context window is the total amount of text - measured in tokens - that a language model can see and process at one time. Everything the model uses to generate a response must fit inside it: the system prompt, the conversation history, any documents you have injected, the user's message, and the response itself.

The word "window" is apt. Tokens that fall outside the window are invisible to the model, as if they never existed. This is not a storage limit or a memory constraint in the traditional computing sense. The model does not gradually forget things as the conversation grows - it simply cannot see anything beyond the boundary of its current context window.

Tokens are not the same as words. A 200-character JSON object can eat 50-80 tokens because every brace, bracket, colon, and indentation gets its own token. The same 200 characters of natural prose might be 45-55 tokens. This matters enormously if you're piping structured API responses or verbose tool outputs into a prompt - you're burning context faster than the character count suggests.

When the window fills up, the truncation is blunt. The oldest tokens are dropped, no intelligent ranking happens, and no automatic summarization occurs. The truncation is mechanical.

When you hit the context limit, the system quietly truncates - usually dropping older messages or context first. If those dropped items included a key user requirement or the correct retrieved document, the model will hallucinate - not because it is confused, but because the information literally disappeared from its view.

How big are these windows right now

In early 2023, most models operated with 4K-8K token windows. By the end of 2025, leading models routinely support 200K tokens or more, with some reaching 1 million tokens or beyond.

The current frontier: Claude Opus 4.6, Claude Sonnet 4.6, Google's Gemini 3.1 Pro, Gemini 3 Flash, and Meta's Llama 4 Maverick sit at 1 million tokens for complex multimodal tasks, enterprise-grade document analysis, and large-scale codebase comprehension.

OpenAI's GPT-5.4 has a 272K standard context window, expandable to 1M in the API with a 2x pricing surcharge above 272K.

To make those numbers concrete: a 128K context window holds approximately 96,000 words - equivalent to a full novel.

A 1M+ context window holds roughly 750,000 words - enough for multiple textbooks or an entire medium-sized codebase.

The cost math is real. Most LLM APIs charge per token - separately for input tokens and output tokens. Input tokens are usually cheaper, but they add up quickly when you are injecting large documents or maintaining long conversation histories. A 200,000-token call costs roughly four times as much as a 50,000-token call, all else being equal. This creates a real economic incentive to be thoughtful about what you put in the context window. Some providers compound this with surcharges: GPT-5.4 charges 2x for input tokens beyond 272K; Gemini 3.1 Pro charges 2x beyond 200K tokens.

The advertised limit is not the effective limit

Here's the finding most teams miss when they spec out a model: a large context window doesn't mean reliable reasoning across the whole thing.

Context rot is the degradation in LLM output quality that happens as input context grows longer - more tokens in, worse output out, even when the model's context window isn't close to full. Chroma's research tested 18 frontier models and found that every single one gets worse as input length increases.

The specific failure mode has a name. The "lost-in-the-middle" effect is a well-documented phenomenon where LLMs perform significantly worse when relevant information sits in the middle of their context rather than at the beginning or end. Liu et al. (2024) measured a 30%+ accuracy drop on multi-document question answering when the answer document moved from position 1 to position 10 in a 20-document context.

The cause is architectural. The U-shaped attention curve is caused by positional encoding biases in the transformer architecture. Rotary Position Embedding (RoPE), used in most modern LLMs, introduces a decay effect that makes models attend more strongly to tokens at the beginning and end of sequences. This mirrors the primacy and recency effects observed in human cognition, but the cause is architectural, not cognitive.

There is also attention dilution: transformer attention is quadratic, so 100K tokens means 10 billion pairwise relationships the model has to compute. Every token you add costs the model something in terms of how precisely it can attend to everything else.

The gap between advertised and effective context can be stark. Research from Paulsen (2025) found that a few top models failed with as little as 100 tokens in context, and many showed clear accuracy degradation by 1,000 tokens, far below their advertised limits.

A model claiming 200K tokens typically becomes unreliable around 130K, with sudden performance drops rather than gradual degradation.

What this means when you're building

Three practical rules follow from the architecture.

Put important things at the edges. System instructions and the most critical retrieved documents belong at the very start of the prompt, or pinned near the end. Testing shows early and late context information achieves 85-95% accuracy, while middle sections drop to 76-82%. That's not a rounding error - it's the difference between a useful tool and a liability.

Don't fill the window just because you can. A large context window is not a free pass to stuff in as much information as possible. Quality and placement of context matter as much as quantity. A well-structured 50,000-token prompt will often outperform a carelessly assembled 150,000-token one. In practice, this means being selective about what gets retrieved into a RAG call, trimming system prompts, and summarizing conversation history rather than carrying the full transcript.

Watch for silent drops in agentic workflows. Agentic systems make dozens or hundreds of tool calls per task - searching databases, reading files, executing code, verifying outputs. With a 1M context window, the entire trace stays intact: every tool call, observation, and intermediate reasoning step. That eliminates the compaction and context-clearing that used to cause agents to lose the plot mid-task. But even at 1M tokens, the rot compounds. One research team building a long-running coding agent noted that every additional step adds messages to the context, steadily eroding reasoning quality, and conservatively capped the effective context window at 200K for models that advertise 1M - informed by degradation studies and internal multi-needle retrieval tests.

A teammate like Beagle, living inside Slack where it handles dozens of active threads simultaneously, has to be deliberate about this: the context assembled for each request is scoped tightly to what's actually relevant - not a raw dump of every message the channel has ever seen.

The honest summary: the context window is working memory, not storage. It has a hard ceiling, a soft ceiling you'll hit before the hard one, and a positional bias that punishes everything in the middle. Context window management is not just about getting under the limit. It's about extracting and structuring the right information so that the model performs well. The goal is to include not less - so nothing critical is left out - and not more - so the model doesn't get overwhelmed or distracted. The sweet spot is providing just enough relevant context for the LLM to deliver useful, accurate results.

What a context window actually contains

How big are these windows right now

The advertised limit is not the effective limit

What this means when you're building

Keep reading

The Slack Message That Sent an API Call Nobody Typed

Trim Your MCP Server List Before It Trims Your Agent

Why Cheaper Tokens Made Your AI Bill Worse