I’ve been following the development of Large Language Models (LLMs) and I have to say, it feels like we’re stuck in an illusion of progress. Sure, we’ve made advancements, but are we really moving forward? In my opinion, not as much as we think.
Firstly, let’s talk about the language barrier. While we’ve made progress in bilingualism, multilingualism is still lagging behind. And don’t even get me started on the assistant paradigm. It’s a mess. Every time you want to generate a simple chunk of text, it tries to make tool calls, and to make matters worse, there’s no standardized template or protocol.
But that’s not all. Comparing LLMs is a total illusion. Even deterministic LLM settings can show significant variations in outputs for the same inputs. And don’t trust the benchmarks too much either. They’re flawed. Take Human Evaluation (HLE) for instance. It’s supposed to be a rigorous benchmark, but it has a major flaw: the answers provided by LLMs are evaluated by… another LLM. This introduces bias and makes the results non-reproducible.
And then there’s the issue of proprietary platforms. I was a heavy user of Gemini-2.5-pro, but then it was removed in favor of a more code/math-oriented model. And now, OpenAI is doing the same thing. They won’t even let users choose between models, and the nomenclature is blurrier than ever.
I’ve come to realize that the most reliable approach is to keep LLMs local and build my own benchmarks. At least that way, I control their configuration, and I can ensure my evaluations are relevant, unbiased, and meaningful for my work.
So, what do you think? Are we really making progress in the LLM world, or are we just stuck in an illusion?