Dr. Mitchell, great article and as usual, your accomplishments here are understated.
Your perturbation results suggest that analogy in current systems is conditionally evoked rather than structurally grounded, which helps explain why accuracy alone systematically overstates competence. That distinction feels like the missing invariant in much of benchmark discourse. What your analysis achieves is an empirical location of where that boundary actually sits.
Narrow benchmarks themselves aren’t the problem; systems improve along the dimensions that are incentivized, so the behavior we observe ultimately reflects which benchmarks are rewarded.
In my own work, I’ve seen a similar boundary in what I think of as the Alex-the-parrot regime: fluent symbolic behavior without an abstraction layer that survives representational re-encoding, and where coherence improves markedly when systems are designed around architectural constraints that distinguish surface competence from structurally grounded capacity.
I always enjoy your articles. This is a great read and the recorded keynote presentation is wonderful to watch.
I remember from your 2019 book a fantastic chapter called "Metaphors We Live By," which you cited was based on the eponymous book by Lakoff & Johnson from 1980. They say at the beginning of their book that "argument is war" metaphor can also be changed to a different metaphor: "argument is a dance."
So regarding your sixth principle and trying to promote publication of more negative results - I wonder if we need a new metaphor to fundamentally change the discussion?
It seems to me that the singular focus on novelty and positive results in the science publication industry is a version of the "argument is war" metaphor. But real progress only happens, as you point out, when we have a figurative situation where "argument is a dance."
Just thinking out loud :) Anyway, thanks for your great writing!
20 and more years ago, there was (and maybe still is) a saying in the IT industry that went "there are lies, damn lies and benchmarks", paraphrasing something Churchill purportedly said about statistics
Worthy, indeed, of a Keynote. Everyone in the AI space should read it!
As you point out, there are many reasons that benchmarks are not a good indicator of how a model will perform outside of the benchmarks. I would add, that benchmarks are not experiments, they are lot logical as measures of what the model is doing to achieve the result. One of the reasons that they are not predictive.
Reliance on benchmarks is an example of the fallacy of affirming the consequent. If the model reasons, then it should be able to solve this problem. It solves the problem, therefore it reasons is fallacious reasoning. The cookies are missing from the cookie jar, therefore Melanie took them. Put another way, many states require a driver to pass a vision test. In California, the test consists of reading letters from a board. A driver could pass the test by having adequate vision or by memorizing the letters in each row. The latter person may not be able to see to drive, but both of them have passed the test. Which one do you want driving near your kids' playground?
You outline some careful experiments that attempt to control for alternative explanations. This kind of experimentation is sorely lacking in AI research. There is a huge history of careful experimenters asking the same question as AI research is facing today (including animal language research, concept learning). As you point out, there is also a lot of work on human cognition, much of it in the 1980s or so dedicated to finding out whether behaviorist (it is not necessary to talk about mind) accounts of human behavior are sufficient. In short, there is a large amount of thinking that can be tapped and reassessed in the context of today's AI issues.
Thank you for raising these issues. The field will be better if the participants pay attention to what you are telling them.
As someone whose background is in neither computers nor psychology, this was clear and comprehensible--and fascinating! As an educator, I appreciate your call for more creativity in designing evaluations for AI tools; I believe 'creative thinking' is one of those 'soft skills' that is going to become increasingly valuable in our future.
Side note: my goodness that conference room is intimidating!!! I've mostly overcome my own stage fright, but speaking in front of that many people would definitely cause me some anxiety!
This is a great article, I’m always happy when I see proper cross-disciplinary thinking with cognitive science in AI. Ever since Attention is All You Need, A/ML has completely sidelined that aspect, and the field has been the poorer for it. I also agree with your contention that current benchmarking has little connection to real-world practical cognition for practical use.
Very cool read and excitingly pragmatic. I find value in your encouragement to embrace negative results, as otherwise, taken poorly they dampen curiosity.
This was a great read, thank you. I’m does seem the key takeaway is less commercial hype and more scientific method.
I’ve had similar views on benchmarks and frustration with the lack of engagement on the gulf between benchmark and real world but you expressed it a lot better than I could have.
Many thanks for this fascinating material. I find it very persuasive and feel very confident that scientific, and then societal, progress in AI (well beyond the current bubble and hype fest) will happen having incorporated some form of your principles.
Dr. Mitchell, great article and as usual, your accomplishments here are understated.
Your perturbation results suggest that analogy in current systems is conditionally evoked rather than structurally grounded, which helps explain why accuracy alone systematically overstates competence. That distinction feels like the missing invariant in much of benchmark discourse. What your analysis achieves is an empirical location of where that boundary actually sits.
Narrow benchmarks themselves aren’t the problem; systems improve along the dimensions that are incentivized, so the behavior we observe ultimately reflects which benchmarks are rewarded.
In my own work, I’ve seen a similar boundary in what I think of as the Alex-the-parrot regime: fluent symbolic behavior without an abstraction layer that survives representational re-encoding, and where coherence improves markedly when systems are designed around architectural constraints that distinguish surface competence from structurally grounded capacity.
I always enjoy your articles. This is a great read and the recorded keynote presentation is wonderful to watch.
I remember from your 2019 book a fantastic chapter called "Metaphors We Live By," which you cited was based on the eponymous book by Lakoff & Johnson from 1980. They say at the beginning of their book that "argument is war" metaphor can also be changed to a different metaphor: "argument is a dance."
So regarding your sixth principle and trying to promote publication of more negative results - I wonder if we need a new metaphor to fundamentally change the discussion?
It seems to me that the singular focus on novelty and positive results in the science publication industry is a version of the "argument is war" metaphor. But real progress only happens, as you point out, when we have a figurative situation where "argument is a dance."
Just thinking out loud :) Anyway, thanks for your great writing!
Cheers.
20 and more years ago, there was (and maybe still is) a saying in the IT industry that went "there are lies, damn lies and benchmarks", paraphrasing something Churchill purportedly said about statistics
Worthy, indeed, of a Keynote. Everyone in the AI space should read it!
As you point out, there are many reasons that benchmarks are not a good indicator of how a model will perform outside of the benchmarks. I would add, that benchmarks are not experiments, they are lot logical as measures of what the model is doing to achieve the result. One of the reasons that they are not predictive.
Reliance on benchmarks is an example of the fallacy of affirming the consequent. If the model reasons, then it should be able to solve this problem. It solves the problem, therefore it reasons is fallacious reasoning. The cookies are missing from the cookie jar, therefore Melanie took them. Put another way, many states require a driver to pass a vision test. In California, the test consists of reading letters from a board. A driver could pass the test by having adequate vision or by memorizing the letters in each row. The latter person may not be able to see to drive, but both of them have passed the test. Which one do you want driving near your kids' playground?
You outline some careful experiments that attempt to control for alternative explanations. This kind of experimentation is sorely lacking in AI research. There is a huge history of careful experimenters asking the same question as AI research is facing today (including animal language research, concept learning). As you point out, there is also a lot of work on human cognition, much of it in the 1980s or so dedicated to finding out whether behaviorist (it is not necessary to talk about mind) accounts of human behavior are sufficient. In short, there is a large amount of thinking that can be tapped and reassessed in the context of today's AI issues.
Thank you for raising these issues. The field will be better if the participants pay attention to what you are telling them.
Here's a link to some related ideas: https://online.fliphtml5.com/ReliathAI/hepq/#p=1,
https://herbertroitblat.substack.com/p/what-in-the-world-is-a-world-model
Thank YOU.
Just FYI, your talk is available to me via the link you gave even though I am NOT registered at NeurIPS 2025. Thanks.
As someone whose background is in neither computers nor psychology, this was clear and comprehensible--and fascinating! As an educator, I appreciate your call for more creativity in designing evaluations for AI tools; I believe 'creative thinking' is one of those 'soft skills' that is going to become increasingly valuable in our future.
Side note: my goodness that conference room is intimidating!!! I've mostly overcome my own stage fright, but speaking in front of that many people would definitely cause me some anxiety!
This is a great article, I’m always happy when I see proper cross-disciplinary thinking with cognitive science in AI. Ever since Attention is All You Need, A/ML has completely sidelined that aspect, and the field has been the poorer for it. I also agree with your contention that current benchmarking has little connection to real-world practical cognition for practical use.
Very cool read and excitingly pragmatic. I find value in your encouragement to embrace negative results, as otherwise, taken poorly they dampen curiosity.
thanks 🙂
This was a great read, thank you. I’m does seem the key takeaway is less commercial hype and more scientific method.
I’ve had similar views on benchmarks and frustration with the lack of engagement on the gulf between benchmark and real world but you expressed it a lot better than I could have.
Really valuable framework
Great presentation!
Many thanks for this fascinating material. I find it very persuasive and feel very confident that scientific, and then societal, progress in AI (well beyond the current bubble and hype fest) will happen having incorporated some form of your principles.
Why are babies considered as “alien” intelligence?
See https://osf.io/preprints/psyarxiv/uacjm_v1