Have openweight AI models actually caught up to GPT5 /…

Have open-weight AI models actually caught up to GPT-5 / Claude 4 / Gemini 2 in practice?

The benchmark gap closed faster than anyone predicted. DeepSeek R2, the Llama 4 series, and Qwen-3 are within 5-10% of the closed-model leaders on the standard suites — sometimes ahead on math and coding.

But I keep running into a delivery gap that the benchmarks don't capture:

Question for the room: in production usage — actually shipping product, not running evals — do you see the gap as closed, narrow but real, or still wide?

6 replies