Every new LLM and every new tweak to an old LLM has a press release bragging about how well it tests on some benchmark you’ve never heard of. Every new model is trained heavily to the previous tren…
Still occasionally think about that bit in the o1 white paper where the openai researchers innocuously pose the question of what if our benchmarks for detecting hallucinations are shit actually, wouldn't that be something.
Yes, the open source LLM community has known this forever. Not only are benchmarks super gamed, many are error filled or otherwise terribly constructed (like using bad or unfair prompt templates, specific templates for specific models, high randomness, and more).
It’s also comical how little money/investment is put into them, and also how many “better” benchmarks are buried under AI Bro hype.
That being said, some are a good “ballpark figure” for certain types of tasks.