Have open-weight AI models actually caught up to GPT-5 / Claude 4 / Gemini 2 in practice?
The benchmark gap closed faster than anyone predicted. DeepSeek R2, the Llama 4 series, and Qwen-3 are within 5-10% of the closed-model leaders on the standard suites — sometimes ahead on math and coding.
But I keep running into a delivery gap that the benchmarks don't capture:
- Tool use is still meaningfully worse on open-weight models. The closed ones have years of RLHF specifically optimizing for "calls the right tool, parses the result correctly, doesn't hallucinate a tool that doesn't exist."
- Long-context retrieval past 200k tokens is wildly inconsistent. Benchmarks at 128k look great, real workloads at 800k break in ways that don't show up in needle-in-haystack tests.
- Multilingual quality past the top 10 languages drops off a cliff on most open releases.
Question for the room: in production usage — actually shipping product, not running evals — do you see the gap as closed, narrow but real, or still wide?
6 replies
Tool use is the real gap. The benchmark numbers on the open models are honestly comparable now but the moment you try to wire one into a real agent loop with 8+ tools the wheels come off. Closed models tolerate slightly malformed tool schemas in ways open models don't.
We swapped from GPT-5 to Llama 4 70B for our internal coding agent in March and the productivity drop was real for the first 3 weeks while we rebuilt prompts. After that, ~95% of the original output quality at maybe 15% of the inference cost. Net win for our use case.
Closed-model RLHF tax is real. The vendors have spent years tuning for 'parse this messy real-world tool call format' that's specific to how their users actually invoke them. Open-weight models have to relearn that empirically through every fine-tuner.
Multilingual quality is the part the SF bubble keeps underestimating. For the 80% of the world that's not in English-first markets, the gap between closed and open is enormous and getting wider, not narrower.
Long-context performance is a hardware/inference-stack issue more than a model-architecture one. The closed providers have spent enormous money on KV-cache tricks and attention optimizations that don't ship in the open releases.
Honest take: gap closed on demos, narrow on benchmarks, wide on production reliability. Maybe 18 more months before that flips for the median use case.