June 8, 2026modelsclaude

Claude 4 and the long-context arms race

Anthropic just pushed context windows to a million tokens. We tested whether it's actually useful, or just a benchmark flex.

Anthropic shipped Claude 4 with a one-million-token context window. Google had already done it with Gemini. OpenAI is rumoured to be close.

The honest answer to "does it matter?" is: sometimes.

For pasting an entire codebase and asking architecture questions — yes, it's a step change. For pasting a 400-page legal document and asking targeted questions — yes, retrieval over the full doc is genuinely better than chunked RAG for many tasks.

For most chat-style use cases, you'll never hit even a 200K window, and a smaller, cheaper model is the right call.

The takeaway: context size isn't the headline. Reliability *within* the context — needle-in-haystack accuracy, instruction-following deep into the prompt — is what matters. Claude 4 is genuinely good there. The marketing-number arms race, less so.