Superb article. Your book AI for thinking Humans is the backbone textbook for a philosophy class I teach in AI and Ethics. My gullibility must perpetually be kept in check!
The "competition level mathematics" one is a bit off. I'm guessing that's referring to AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/), which is extremely impressive and cool. But it hooks a language model up to a theorem proving and symbolic algebra engine. Many problems in Euclidean geometry, trigonometry and calculus (as in math Olympiad problems) have mechanistically determinable answers once translated into algebraic formulations. Presumably if competitors got to use Mathematica, their scores would improve on the competitions as well. Still, it is extremely encouraging that language models can be hooked up to existing symbolic math systems like this. It should dramatically expand the capabilities of those systems, making them much more powerful tools.
A better test for "human level ability" would be the Putnam exam, where getting a nonzero score is above the ability of most math-major undergrads, and there is a pretty good correlation between top scorers and a brilliant career in math (e.g. several Putnam fellows went on to win fields medals).
Melanie, I saw that headline when it came out and rolled my eyes. Yours is the most clear-eyed unpacking of all that is misleading about the headline and the article—and more importantly, why it's misleading. Let's hold the flag of critical thinking high!
Thanks for this! The skin cancer example gives a powerful warning on the sensitivity of classifiers to artifacts in the training data, even in training sets that have been meticulously groomed for bias.
It seems that an important thing missing from evaluations of AI is a theory of intelligence. Without it we will be accruing countless epicycles upon epicycles. This seems to me a crucial aspect. Imagine doing physics just gathering data without a theory to make sense of it.
Maybe another way to say this is to gather together all the inputs and results from many many physics experiments, use this to train a really large machine learning model and publish that you have a model that understands the physical world. Maybe use this model to teach physics!
Thanks for referencing my favorite AI paper of all time: Drew's "AI Meets NS".
It's sad that Drew, and Marvin, and Roger are all gone: they all understood how amazing human intelligence is and how far we are from that in our programs. One of Roger's last statements was "There's no such thing as AI", which remains spot on: more people ought to be saying that.
There is one mistake though: you wrote "And AI systems have exceeded humans on all these benchmarks." This isn't true. We can see on the graph that on Visual Commonsense Reasoning tasks, AI is clearly below the human baseline. This is because commonsense reasoning is particularly tough for machines, and that's why tests like HellaSwag or WinoGrande are easier for humans than for machines.
Really great point as the core message of this article! Over generalizing what are very limited examples / tests may be a form of sensationalizing the content (e.g., a more sophisticated "click-bait" effect).
I bet it wouldn't be hard to come up with special versions of the standardized tests that were specifically designed to head-fake LLMs that would make them fail horribly, while still functioning just fine with humans.
It’s good to remember some caveats like the ones you mentioned when extrapolating benchmark findings into sensationalist claims, fair enough.
That said, this article strikes me as less well done compared to some of your prior ones. Not just the provocative title. And not just due to multiple typos like “sentence compression (comprehension)” or large language modes (models) or “that AI that performance”. It feels like the focus here is less on the outcome (AI beats humans in performance benchmarks) and more on the fact that we don’t fully understand the underlying mechanisms (both in humans and in AI) and hence perhaps AI is only “simulating understanding”, not really understanding like humans?
Giving the impression of general ability on a particularized task is not the same as generally performing better than humans on a particular task under a set of particularities in an open environment. The latter requirement makes most of today's systems still useless because, as you also pointed out, they are highly sensitive to small perturbations in their test data sets relative to training data sets. We are still far from advanced AI systems, and very far from AGI systems.
I would rewrite that Nature title to "AI beats humans in making more stupid mistakes on basic tasks". Not mentioning that it can't explain any of it. Dog ate my gradient descent, perhaps?
It was not «one of the world’s most prestigious journals»: it was 'Nature News', the science news outlet of the Springer Nature group, which includes more than 200 journals with the prefix 'Nature'.
I would add one more potential explanation to why models seem to do well on benchmarks. The benchmarks have very likely been covered among the data on which the models were trained.
Superb article. Your book AI for thinking Humans is the backbone textbook for a philosophy class I teach in AI and Ethics. My gullibility must perpetually be kept in check!
The "competition level mathematics" one is a bit off. I'm guessing that's referring to AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/), which is extremely impressive and cool. But it hooks a language model up to a theorem proving and symbolic algebra engine. Many problems in Euclidean geometry, trigonometry and calculus (as in math Olympiad problems) have mechanistically determinable answers once translated into algebraic formulations. Presumably if competitors got to use Mathematica, their scores would improve on the competitions as well. Still, it is extremely encouraging that language models can be hooked up to existing symbolic math systems like this. It should dramatically expand the capabilities of those systems, making them much more powerful tools.
A better test for "human level ability" would be the Putnam exam, where getting a nonzero score is above the ability of most math-major undergrads, and there is a pretty good correlation between top scorers and a brilliant career in math (e.g. several Putnam fellows went on to win fields medals).
Melanie, I saw that headline when it came out and rolled my eyes. Yours is the most clear-eyed unpacking of all that is misleading about the headline and the article—and more importantly, why it's misleading. Let's hold the flag of critical thinking high!
Thanks for this! The skin cancer example gives a powerful warning on the sensitivity of classifiers to artifacts in the training data, even in training sets that have been meticulously groomed for bias.
It seems that an important thing missing from evaluations of AI is a theory of intelligence. Without it we will be accruing countless epicycles upon epicycles. This seems to me a crucial aspect. Imagine doing physics just gathering data without a theory to make sense of it.
Maybe another way to say this is to gather together all the inputs and results from many many physics experiments, use this to train a really large machine learning model and publish that you have a model that understands the physical world. Maybe use this model to teach physics!
I agree!
Thanks for referencing my favorite AI paper of all time: Drew's "AI Meets NS".
It's sad that Drew, and Marvin, and Roger are all gone: they all understood how amazing human intelligence is and how far we are from that in our programs. One of Roger's last statements was "There's no such thing as AI", which remains spot on: more people ought to be saying that.
Good points here, no doubt.
There is one mistake though: you wrote "And AI systems have exceeded humans on all these benchmarks." This isn't true. We can see on the graph that on Visual Commonsense Reasoning tasks, AI is clearly below the human baseline. This is because commonsense reasoning is particularly tough for machines, and that's why tests like HellaSwag or WinoGrande are easier for humans than for machines.
Thanks, I fixed it!
Really great point as the core message of this article! Over generalizing what are very limited examples / tests may be a form of sensationalizing the content (e.g., a more sophisticated "click-bait" effect).
And don't forget that benchmark numbers are often presented based on multi-shot, not zero-shot. See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/ for this element of lies, big lies, statistics, benchmarks. And that they're engineering the hell around the limitations, see https://ea.rna.nl/2024/02/07/the-department-of-engineering-the-hell-out-of-ai/
Well worth the two minutes it took me to read this!
I bet it wouldn't be hard to come up with special versions of the standardized tests that were specifically designed to head-fake LLMs that would make them fail horribly, while still functioning just fine with humans.
It’s good to remember some caveats like the ones you mentioned when extrapolating benchmark findings into sensationalist claims, fair enough.
That said, this article strikes me as less well done compared to some of your prior ones. Not just the provocative title. And not just due to multiple typos like “sentence compression (comprehension)” or large language modes (models) or “that AI that performance”. It feels like the focus here is less on the outcome (AI beats humans in performance benchmarks) and more on the fact that we don’t fully understand the underlying mechanisms (both in humans and in AI) and hence perhaps AI is only “simulating understanding”, not really understanding like humans?
Typos fixed!
Giving the impression of general ability on a particularized task is not the same as generally performing better than humans on a particular task under a set of particularities in an open environment. The latter requirement makes most of today's systems still useless because, as you also pointed out, they are highly sensitive to small perturbations in their test data sets relative to training data sets. We are still far from advanced AI systems, and very far from AGI systems.
I would rewrite that Nature title to "AI beats humans in making more stupid mistakes on basic tasks". Not mentioning that it can't explain any of it. Dog ate my gradient descent, perhaps?
It was not «one of the world’s most prestigious journals»: it was 'Nature News', the science news outlet of the Springer Nature group, which includes more than 200 journals with the prefix 'Nature'.
(Other than that, 👏👏👏)
I would add one more potential explanation to why models seem to do well on benchmarks. The benchmarks have very likely been covered among the data on which the models were trained.