Last week saw the release of ‘Humanity’s Last Exam’, a research project with a dramatic-sounding title. However, once you understand the details of the test, it quickly becomes clear that the name is anything but an exaggeration. This test is the result of a global collaboration among academics to develop questions that today’s AI language models simply cannot answer.
Why such a test? The reasoning is simple: existing benchmarks fall short. We’re running out of tests where language models don’t already achieve near-perfect scores. On top of that, it’s hard to be certain whether these models haven’t encountered such tests during their training. And that’s crucial if we want to fairly assess how intelligent current AI systems truly are.
The initiative came from Dan Hendrycks of the Center for AI Safety, in collaboration with Scale AI. Academics were challenged to create questions that were clear, precise, and unique—questions that no state-of-the-art model could answer. This involved models from organisations such as OpenAI, Google, and Anthropic.
The topics varied widely, ranging from mathematics, physics, biology, and chemistry to engineering, medicine, psychology, classical languages, and philosophy. This broad scope was essential to uncover the different strengths and weaknesses of AI systems.
The stakes were high. Approved questions could earn significant financial rewards, with payments reaching thousands of dollars per question. However, the process proved far from easy. Over 70,000 questions were submitted, but only 3,000 met the strict requirements. This demonstrates not only how difficult it is to create truly challenging questions but also how rigorous the selection criteria were. Naturally, the answers couldn’t simply be found on the internet or in textbooks—the questions had to be at the level of a doctoral student or higher.
Many of these questions were so complex that it’s believed only a handful of people in Belgium would know the answers; for some, perhaps no one at all. That shows how deep the test went. Not all questions have been made public; some are intentionally kept under wraps to prevent them from being used as training material. This is crucial for keeping future tests relevant and fair.
Here’s an example of a question, shared in English to preserve its nuances:
Suppose the following four sentences are true:
- Cats eat mice.
- Dogs are strictly carnivorous.
- Mice are actually vegetables.
- Dogs are actually cats.
How many subsets of these four sentences are inconsistent?
What makes this experiment truly remarkable isn’t just the quality of the questions or the low scores of the models—it’s the process itself. For many academics, it was a sobering experience. Crafting questions that AI couldn’t handle turned out to be extraordinarily challenging.
This calls to mind the Lee Sedol moment. Just as the Go champion in 2016 had to accept that a machine had surpassed him in his own game, we now realise that AI is rapidly catching up with us in more and more domains, even in specific academic disciplines.
It’s a moment of awe and clarity. At a time when some sceptics still dismiss AI systems as little more than hallucinating chatbots, this experiment serves as a reality check—beyond anecdotes and gut feelings.
But it’s also a moment of discomfort. This wasn’t just a test for machines; it was a mirror held up to ourselves. It confronted us with fundamental questions: How do we deal with technology that surpasses us intellectually?
With our research group, we contributed to this study and submitted questions for the dataset. We, too, found ourselves looking directly into the mirror.
“It’s becoming nearly impossible to ask AI questions it cannot answer”
This experiment, nicknamed “Humanity’s Last Exam,” reveals that we stand on the brink of a new era. While it is a test for AI, it is perhaps even more so an exam for humanity.
It’s not just about determining how to manage this technology, but about understanding what it means to be human in a world where machines surpass us in ever more areas. Are we still the creators, or are we becoming co-players? Can we redefine our intellectual role—not by defeating machines, but by collaborating with them to build something greater than ourselves?
These aren’t questions for some distant future when even more powerful models exist—they were relevant yesterday, and they are pressing today.
It’s exam time—not just for AI, but for us as well.
By Vincent Ginis (Professor of Mathematics, Physics and Artificial Intelligence – Data Lab, VUB and Harvard University), Andres Algaba (FWO Postdoctoral Researcher and Member of the Young Academy – Data Lab, VUB), and Brecht Verbeken (Postdoctoral Researcher – Data Lab, VUB).