Superb article. Your book AI for thinking Humans is the backbone textbook for a philosophy class I teach in AI and Ethics. My gullibility must perpetually be kept in check!
The "competition level mathematics" one is a bit off. I'm guessing that's referring to AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/), which is extremely impressive and cool. But it hooks a language model up to a theorem proving and symbolic algebra engine. Many problems in Euclidean geometry, trigonometry and calculus (as in math Olympiad problems) have mechanistically determinable answers once translated into algebraic formulations. Presumably if competitors got to use Mathematica, their scores would improve on the competitions as well. Still, it is extremely encouraging that language models can be hooked up to existing symbolic math systems like this. It should dramatically expand the capabilities of those systems, making them much more powerful tools.
A better test for "human level ability" would be the Putnam exam, where getting a nonzero score is above the ability of most math-major undergrads, and there is a pretty good correlation between top scorers and a brilliant career in math (e.g. several Putnam fellows went on to win fields medals).
Melanie, I saw that headline when it came out and rolled my eyes. Yours is the most clear-eyed unpacking of all that is misleading about the headline and the article—and more importantly, why it's misleading. Let's hold the flag of critical thinking high!
Thanks for this! The skin cancer example gives a powerful warning on the sensitivity of classifiers to artifacts in the training data, even in training sets that have been meticulously groomed for bias.
It seems that an important thing missing from evaluations of AI is a theory of intelligence. Without it we will be accruing countless epicycles upon epicycles. This seems to me a crucial aspect. Imagine doing physics just gathering data without a theory to make sense of it.
Maybe another way to say this is to gather together all the inputs and results from many many physics experiments, use this to train a really large machine learning model and publish that you have a model that understands the physical world. Maybe use this model to teach physics!
Thanks for referencing my favorite AI paper of all time: Drew's "AI Meets NS".
It's sad that Drew, and Marvin, and Roger are all gone: they all understood how amazing human intelligence is and how far we are from that in our programs. One of Roger's last statements was "There's no such thing as AI", which remains spot on: more people ought to be saying that.
There is one mistake though: you wrote "And AI systems have exceeded humans on all these benchmarks." This isn't true. We can see on the graph that on Visual Commonsense Reasoning tasks, AI is clearly below the human baseline. This is because commonsense reasoning is particularly tough for machines, and that's why tests like HellaSwag or WinoGrande are easier for humans than for machines.
Really great point as the core message of this article! Over generalizing what are very limited examples / tests may be a form of sensationalizing the content (e.g., a more sophisticated "click-bait" effect).
If I may, oftentimes Nature contents are beyond the pale, talking in tongues: becoming oracle, pronouncements to be believed, and demanding faith (https://www.nature.com/articles/529437a). In this context, Nature has to declare that this lavish profuse is not a commercial propaganda paid for by Google DeepMind, and that it is indeed an Editorial as labeled (failure to declare one for-profit Nature working for another for-profit Google DeepMind, along with a misleading label, is potentially a crime, if not a serious violation of ethical norms). On a related crime-watch, Nature Human Behaviour has been labeling as Correspondence (https://www.nature.com/articles/s41562-023-01716-4), which would give the impression that it's something author-initiated (possibly in response to something published). I was shocked to find that this intentionally/knowingly mislabeled "Correspondence" is an Invited Contribution, which is a different category of articles.
Regarding mathematics, I'd be happy to see AI abstract the architecture of mathematics (https://philpapers.org/rec/RAYCAA-2). Given the universal mapping property definitions of mathematical objects and operations, or, equivalently, mathematical definitions in terms of 'good for', which can be thought of as a refinement of the method of functional definition, readily lend themselves to statistical abstraction (e.g., https://www.youtube.com/watch?v=A-rfmuduGyY). If anything, I'm surprised that AI hasn't been able to statistically abstract the architecture of mathematics already. I'll talk to ChatGPT once I finalize my long overdue proposal (Nirakara AND Nirguna: The Holy Grail of Mathematics).
In the meantime, please note that:
Structure : Architecture :: Function : Good for
On a related note, mathematical objects are not structures; but they can be represented/modeled, with respect to a given doctrine (https://zenodo.org/records/7087851), as structures (on an almost structureless background category of sets; https://zenodo.org/records/7087938).
I bet it wouldn't be hard to come up with special versions of the standardized tests that were specifically designed to head-fake LLMs that would make them fail horribly, while still functioning just fine with humans.
It’s good to remember some caveats like the ones you mentioned when extrapolating benchmark findings into sensationalist claims, fair enough.
That said, this article strikes me as less well done compared to some of your prior ones. Not just the provocative title. And not just due to multiple typos like “sentence compression (comprehension)” or large language modes (models) or “that AI that performance”. It feels like the focus here is less on the outcome (AI beats humans in performance benchmarks) and more on the fact that we don’t fully understand the underlying mechanisms (both in humans and in AI) and hence perhaps AI is only “simulating understanding”, not really understanding like humans?
Giving the impression of general ability on a particularized task is not the same as generally performing better than humans on a particular task under a set of particularities in an open environment. The latter requirement makes most of today's systems still useless because, as you also pointed out, they are highly sensitive to small perturbations in their test data sets relative to training data sets. We are still far from advanced AI systems, and very far from AGI systems.
I would rewrite that Nature title to "AI beats humans in making more stupid mistakes on basic tasks". Not mentioning that it can't explain any of it. Dog ate my gradient descent, perhaps?
It was not «one of the world’s most prestigious journals»: it was 'Nature News', the science news outlet of the Springer Nature group, which includes more than 200 journals with the prefix 'Nature'.
Superb article. Your book AI for thinking Humans is the backbone textbook for a philosophy class I teach in AI and Ethics. My gullibility must perpetually be kept in check!
The "competition level mathematics" one is a bit off. I'm guessing that's referring to AlphaGeometry (https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/), which is extremely impressive and cool. But it hooks a language model up to a theorem proving and symbolic algebra engine. Many problems in Euclidean geometry, trigonometry and calculus (as in math Olympiad problems) have mechanistically determinable answers once translated into algebraic formulations. Presumably if competitors got to use Mathematica, their scores would improve on the competitions as well. Still, it is extremely encouraging that language models can be hooked up to existing symbolic math systems like this. It should dramatically expand the capabilities of those systems, making them much more powerful tools.
A better test for "human level ability" would be the Putnam exam, where getting a nonzero score is above the ability of most math-major undergrads, and there is a pretty good correlation between top scorers and a brilliant career in math (e.g. several Putnam fellows went on to win fields medals).
Melanie, I saw that headline when it came out and rolled my eyes. Yours is the most clear-eyed unpacking of all that is misleading about the headline and the article—and more importantly, why it's misleading. Let's hold the flag of critical thinking high!
Thanks for this! The skin cancer example gives a powerful warning on the sensitivity of classifiers to artifacts in the training data, even in training sets that have been meticulously groomed for bias.
It seems that an important thing missing from evaluations of AI is a theory of intelligence. Without it we will be accruing countless epicycles upon epicycles. This seems to me a crucial aspect. Imagine doing physics just gathering data without a theory to make sense of it.
Maybe another way to say this is to gather together all the inputs and results from many many physics experiments, use this to train a really large machine learning model and publish that you have a model that understands the physical world. Maybe use this model to teach physics!
I agree!
Thanks for referencing my favorite AI paper of all time: Drew's "AI Meets NS".
It's sad that Drew, and Marvin, and Roger are all gone: they all understood how amazing human intelligence is and how far we are from that in our programs. One of Roger's last statements was "There's no such thing as AI", which remains spot on: more people ought to be saying that.
Good points here, no doubt.
There is one mistake though: you wrote "And AI systems have exceeded humans on all these benchmarks." This isn't true. We can see on the graph that on Visual Commonsense Reasoning tasks, AI is clearly below the human baseline. This is because commonsense reasoning is particularly tough for machines, and that's why tests like HellaSwag or WinoGrande are easier for humans than for machines.
Thanks, I fixed it!
Really great point as the core message of this article! Over generalizing what are very limited examples / tests may be a form of sensationalizing the content (e.g., a more sophisticated "click-bait" effect).
Dear Professor Mitchell,
If I may, oftentimes Nature contents are beyond the pale, talking in tongues: becoming oracle, pronouncements to be believed, and demanding faith (https://www.nature.com/articles/529437a). In this context, Nature has to declare that this lavish profuse is not a commercial propaganda paid for by Google DeepMind, and that it is indeed an Editorial as labeled (failure to declare one for-profit Nature working for another for-profit Google DeepMind, along with a misleading label, is potentially a crime, if not a serious violation of ethical norms). On a related crime-watch, Nature Human Behaviour has been labeling as Correspondence (https://www.nature.com/articles/s41562-023-01716-4), which would give the impression that it's something author-initiated (possibly in response to something published). I was shocked to find that this intentionally/knowingly mislabeled "Correspondence" is an Invited Contribution, which is a different category of articles.
Regarding mathematics, I'd be happy to see AI abstract the architecture of mathematics (https://philpapers.org/rec/RAYCAA-2). Given the universal mapping property definitions of mathematical objects and operations, or, equivalently, mathematical definitions in terms of 'good for', which can be thought of as a refinement of the method of functional definition, readily lend themselves to statistical abstraction (e.g., https://www.youtube.com/watch?v=A-rfmuduGyY). If anything, I'm surprised that AI hasn't been able to statistically abstract the architecture of mathematics already. I'll talk to ChatGPT once I finalize my long overdue proposal (Nirakara AND Nirguna: The Holy Grail of Mathematics).
In the meantime, please note that:
Structure : Architecture :: Function : Good for
On a related note, mathematical objects are not structures; but they can be represented/modeled, with respect to a given doctrine (https://zenodo.org/records/7087851), as structures (on an almost structureless background category of sets; https://zenodo.org/records/7087938).
Thanking you, yours truly, posina
And don't forget that benchmark numbers are often presented based on multi-shot, not zero-shot. See https://ea.rna.nl/2023/12/08/state-of-the-art-gemini-gpt-and-friends-take-a-shot-at-learning/ for this element of lies, big lies, statistics, benchmarks. And that they're engineering the hell around the limitations, see https://ea.rna.nl/2024/02/07/the-department-of-engineering-the-hell-out-of-ai/
Well worth the two minutes it took me to read this!
I bet it wouldn't be hard to come up with special versions of the standardized tests that were specifically designed to head-fake LLMs that would make them fail horribly, while still functioning just fine with humans.
It’s good to remember some caveats like the ones you mentioned when extrapolating benchmark findings into sensationalist claims, fair enough.
That said, this article strikes me as less well done compared to some of your prior ones. Not just the provocative title. And not just due to multiple typos like “sentence compression (comprehension)” or large language modes (models) or “that AI that performance”. It feels like the focus here is less on the outcome (AI beats humans in performance benchmarks) and more on the fact that we don’t fully understand the underlying mechanisms (both in humans and in AI) and hence perhaps AI is only “simulating understanding”, not really understanding like humans?
Typos fixed!
Giving the impression of general ability on a particularized task is not the same as generally performing better than humans on a particular task under a set of particularities in an open environment. The latter requirement makes most of today's systems still useless because, as you also pointed out, they are highly sensitive to small perturbations in their test data sets relative to training data sets. We are still far from advanced AI systems, and very far from AGI systems.
I would rewrite that Nature title to "AI beats humans in making more stupid mistakes on basic tasks". Not mentioning that it can't explain any of it. Dog ate my gradient descent, perhaps?
It was not «one of the world’s most prestigious journals»: it was 'Nature News', the science news outlet of the Springer Nature group, which includes more than 200 journals with the prefix 'Nature'.
(Other than that, 👏👏👏)