Two weeks ago, Nature, one of the world’s most prestigious journals, had this jarring headline:
The article explained this further:
“Artificial intelligence (AI) systems, such as the chatbot ChatGPT, have become so advanced that they now very nearly match or exceed human performance in tasks including reading comprehension, image classification and competition-level mathematics, according to a new report”.1
Nature included this graph to back up their claim:
It’s amazing to see the progress of AI over the last few years, but the claim that “AI now beats humans on basic tasks” has to be taken with a large grain of salt.
The reason is that this claim is based on performance of AI systems on particular benchmarks. The benchmarks are labeled as testing “Image Classification”, “Reading Comprehension”, “Visual Commonsense Reasoning”, “General Language Understanding”, and so on. And AI systems have nearly matched or exceeded humans on all these benchmarks.
But let me repeat an important mantra:
AI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability.
For example, just because a benchmark has “reading comprehension” in its name doesn’t mean that it tests general reading comprehension.
Why is this the case?
There are at least four reasons that AI performance on a benchmark can be misleading.
1. Data contamination: The questions (and answers) from a given benchmark might have been part of the training data for the AI system. OpenAI, for example, has not released any information about the training data of their most advanced models, so we can’t check whether or not this is the case.
2. Over-reliance on training data: The system might not have been trained on the benchmark itself, but on similar items that require similar patterns of reasoning, and the system might be relying on such patterns to solve benchmark items rather than using more general abstract reasoning. Multiple studies have shown that LLMs maybe rely on such “approximate retrieval“ methods to solve problems.
3. Shortcut learning: The AI system might be, in some cases, relying on spurious correlations or “shortcuts” in test items. As a particularly blatant case of this, one study found that an AI system which had successfully classified malignant skin cancers from photographs was using the presence of a ruler in the images as an important cue (photos of nonmalignant tumors were less likely to include rulers). Another study showed that a system attained high performance on a “natural language inference” benchmark by learning that certain keywords were (unintentionally) correlated with correct answers. Many examples of such shortcut learning have been described in the machine learning literature, and it’s likely that large language models have unprecedented ability to discover subtle patterns of association in language that can predict correct answers in benchmarks without requiring the kind of abstract reasoning that humans are more likely to use.
4. Test validity: performance on such benchmark might not correlate with performance in the real world, in the same way it does for humans.
These issues, among others, have led many researchers to share the sentiment that “evaluation for many natural language understanding tasks is broken.”
AI has made astounding progress, but the assumption that the Nature headline makes is not correct. While AI systems beat humans on many benchmarks, we cannot conclude that “AI now beats humans at basic tasks” more generally. Again and again, it’s been shown that AI performance on benchmarks is not necessarily a good predictor for AI performance in the real world.
In previous writing I’ve noted that giving specific benchmarks the names of general abilities—”reading comprehension”, “commonsense reasoning”, “image classification”—is a form of “wishful mnemonic”: This is what the dataset creators hope their dataset tests, but that hope does not always translate into reality.
Nature’s headline notes that “new benchmarks are needed.” This is true, but what’s really needed is better scientific methods for evaluation, ones that control for shortcuts, that test robustness over variations on both the form of test items and on the underlying concepts being assessed, along with other ways to assess the mechanisms by which machines are performing tasks. I have written about this, as have cognitive scientists Michael Frank, Anna Ivanova, and Raphaël Millière, among others.
I hope that Nature and other media that report on AI benchmark results will learn the mantra: AI surpassing humans on a benchmark that is named after a general ability is not the same as AI surpassing humans on that general ability.
The report referenced here is the recently released 2024 AI Index, a report from Stanford university that provides an annual data assessing advancements in AI.
Superb article. Your book AI for thinking Humans is the backbone textbook for a philosophy class I teach in AI and Ethics. My gullibility must perpetually be kept in check!
The "competition level mathematics" one is a bit off. I'm guessing that's referring to AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/), which is extremely impressive and cool. But it hooks a language model up to a theorem proving and symbolic algebra engine. Many problems in Euclidean geometry, trigonometry and calculus (as in math Olympiad problems) have mechanistically determinable answers once translated into algebraic formulations. Presumably if competitors got to use Mathematica, their scores would improve on the competitions as well. Still, it is extremely encouraging that language models can be hooked up to existing symbolic math systems like this. It should dramatically expand the capabilities of those systems, making them much more powerful tools.
A better test for "human level ability" would be the Putnam exam, where getting a nonzero score is above the ability of most math-major undergrads, and there is a pretty good correlation between top scorers and a brilliant career in math (e.g. several Putnam fellows went on to win fields medals).