Great article. I posit that reluctance to try to reproduce results and examine failure is why true progress in AI research gets sidelined in favor of the elusive search for clever new ways to hack benchmarks. There is no percentage in asking, "Did this model get the correct results for the wrong reasons," if you work for one of the frontier labs. It's a common-sense question and over a beer, I'm sure a lot of researchers would agree it's a worthwhile thought experiment. In practice, you'd find research teams reluctant to carry it out, precisely BECAUSE of the notion that everything is moving so fast, which is to say, a culture of manic FOMO, they don't want to be slowed down by what they'd frame as "side issues," when those systemic eval flaws are actually central to ongoing performance failure. IMO, one reason frontier labs avoid digging into failure modes too much is it will point to architecture issues that are not easily fixed. All of which comes down to the old saying, "It is hard to get somebody to understand something (e.g. the need for curiosity abt both success and failure) when his job depends on NOT understanding it. Also suspect the data contamination that quite conveniently allows models to score higher on benchmarks is not accidental, but a kind of contamination that is avidly sought for training data. Meta admitted as much.
I always enjoy your articles. This is a great read and the recorded keynote presentation is wonderful to watch.
I remember from your 2019 book a fantastic chapter called "Metaphors We Live By," which you cited was based on the eponymous book by Lakoff & Johnson from 1980. They say at the beginning of their book that "argument is war" metaphor can also be changed to a different metaphor: "argument is a dance."
So regarding your sixth principle and trying to promote publication of more negative results - I wonder if we need a new metaphor to fundamentally change the discussion?
It seems to me that the singular focus on novelty and positive results in the science publication industry is a version of the "argument is war" metaphor. But real progress only happens, as you point out, when we have a figurative situation where "argument is a dance."
Just thinking out loud :) Anyway, thanks for your great writing!
20 and more years ago, there was (and maybe still is) a saying in the IT industry that went "there are lies, damn lies and benchmarks", paraphrasing something Churchill purportedly said about statistics
Dr. Mitchell, great article and as usual, your accomplishments here are understated.
Your perturbation results suggest that analogy in current systems is conditionally evoked rather than structurally grounded, which helps explain why accuracy alone systematically overstates competence. That distinction feels like the missing invariant in much of benchmark discourse. What your analysis achieves is an empirical location of where that boundary actually sits.
Narrow benchmarks themselves aren’t the problem; systems improve along the dimensions that are incentivized, so the behavior we observe ultimately reflects which benchmarks are rewarded.
In my own work, I’ve seen a similar boundary in what I think of as the Alex-the-parrot regime: fluent symbolic behavior without an abstraction layer that survives representational re-encoding, and where coherence improves markedly when systems are designed around architectural constraints that distinguish surface competence from structurally grounded capacity.
You had me at 'clever Hans' :-) But that great illustration aside, your concept of 'approximate retrieval' crystallizes something I've been wrestling with since your Othello post. The 8x8→10x10 brittleness showed memorized structure, not abstracted causality. Now I see the mechanism more clearly: LLMs aren't building world models; they're interpolating within a vast, static space of training patterns.
What strikes me is the contrast in efficiency. The brain uses 20 watts precisely because it can't afford to store everything; it's forced to allocate attention dynamically and build causal models on the fly. The constraints aren't bugs; they're what drove modularity and plasticity. LLMs, unconstrained by energy or memory limits, took a different path: scale the interpolation space until it approximates understanding.
This suggests benchmarks might need to test not just performance, but efficiency under constraint, can the system maintain capability when we deliberately limit its resources? That's closer to how evolution tested intelligence.
Has there been research that developed approaches where a system asks the question “how do I know the result is correct” or some other type of meta reasoning?
Not that I am aware of. It raises an interesting and potentially illuminating question - why not?
This is total conjecture on my part, but I wonder if doing so isn’t undesirable for the model owners. They might see this as undermining the value, ironically I believe it would be a point of differentiation. Furthermore, conjecture continues, perhaps this is harder than we might believe given the current architecture. If this is true, and as more people ask the question you raise, then they have a much bigger problem - their models have a massive and potentially catastrophic blind spot …. which most definitely isn’t going in any marketing or promotional copy.
Again, I want to be clear that I don’t know these things to be true, but they are I hope educated guesses.
> This last principle says that one of the best things you can do is look at the 12% of items that your model got wrong. Analyzing the types of errors made by a system can be one of the best ways to get insight into how the model works.
Another surprise to me is that someone needed to say this out loud. During my software career I transitioned from ML/DataSci to security engineering. It’s *obvious* that the false negatives a system produces are more valuable for improving it than a raft of true (or even false) positive detections.
This is a great article. I strongly agree that AI needs to be taken off the “innovation-driven” shelf and placed firmly on the “cognitive science” shelf. Treating AI primarily as a scientific object of study, rather than just a technological product, seems essential if we want to make real progress in understanding what these systems can (and cannot) actually do.
This is a great article, I’m always happy when I see proper cross-disciplinary thinking with cognitive science in AI. Ever since Attention is All You Need, A/ML has completely sidelined that aspect, and the field has been the poorer for it. I also agree with your contention that current benchmarking has little connection to real-world practical cognition for practical use.
Very cool read and excitingly pragmatic. I find value in your encouragement to embrace negative results, as otherwise, taken poorly they dampen curiosity.
Worthy, indeed, of a Keynote. Everyone in the AI space should read it!
As you point out, there are many reasons that benchmarks are not a good indicator of how a model will perform outside of the benchmarks. I would add, that benchmarks are not experiments, they are lot logical as measures of what the model is doing to achieve the result. One of the reasons that they are not predictive.
Reliance on benchmarks is an example of the fallacy of affirming the consequent. If the model reasons, then it should be able to solve this problem. It solves the problem, therefore it reasons is fallacious reasoning. The cookies are missing from the cookie jar, therefore Melanie took them. Put another way, many states require a driver to pass a vision test. In California, the test consists of reading letters from a board. A driver could pass the test by having adequate vision or by memorizing the letters in each row. The latter person may not be able to see to drive, but both of them have passed the test. Which one do you want driving near your kids' playground?
You outline some careful experiments that attempt to control for alternative explanations. This kind of experimentation is sorely lacking in AI research. There is a huge history of careful experimenters asking the same question as AI research is facing today (including animal language research, concept learning). As you point out, there is also a lot of work on human cognition, much of it in the 1980s or so dedicated to finding out whether behaviorist (it is not necessary to talk about mind) accounts of human behavior are sufficient. In short, there is a large amount of thinking that can be tapped and reassessed in the context of today's AI issues.
Thank you for raising these issues. The field will be better if the participants pay attention to what you are telling them.
The real problem with applying human-designed cognitive tests like the WAIS intelligence test to LLMs isn’t anthropomorphism per se it’s that the normative framework is meaningless. A WAIS score means something because of what it predicts in a human life: educational outcomes, occupational functioning, adaptive capacity. Applying that metric to an LLM produces a number with no referent. An LLM may well score at the 99th percentile but that tells us nothing about its capacities on their own terms, because the score was never designed to predict anything beyond human functioning.
Cool to see you in the NYT piece. Great you say that its a myth AI has "magic" or "emergent" properties. I think the word "emergent" has come to be used in science overall as some sort of magical invocation... which to me is a sign that our current science is limited and simply can't cope with many real world phenomena.
Thank you for sharing and synthesizing this, Melanie. I do not consider myself a 'killjoy' scientist, but I have to say I am increasingly unhappy not to be able to explain when something is working (let alone not working). Working on evolutionary algorithms does this to you -- you reach for mechanisms (why, why not). I remember a very distinct experience at a conference last year. An engineering from Google (hope I am attributing this right) was making the argument that LLMs are solving graph-related problems. He gave his keynote and all were wowed, as usual. I raised my hand and asked whether he could understand what was going on. He immediately went to the negative results, trying to say the usual 'oh, we did not investigate these but will...future work being all that is -- probably never will" I had to interrupt him: "No, my apologies. On the stuff that worked, did you ask yourself why?" He had no answer. It left a distinct taste, like having over-paid for a bad cup of coffee.
Dear Professor Mitchell, this is really good article, thank you for writing it.
Please could you link your substack to your web page? I found this article a while ago by chance and wanted to refer to it today - I spent quite a while hunting for it on your website!
Anyway - great stuff, keep on keeping the fire alive!
Great article. I posit that reluctance to try to reproduce results and examine failure is why true progress in AI research gets sidelined in favor of the elusive search for clever new ways to hack benchmarks. There is no percentage in asking, "Did this model get the correct results for the wrong reasons," if you work for one of the frontier labs. It's a common-sense question and over a beer, I'm sure a lot of researchers would agree it's a worthwhile thought experiment. In practice, you'd find research teams reluctant to carry it out, precisely BECAUSE of the notion that everything is moving so fast, which is to say, a culture of manic FOMO, they don't want to be slowed down by what they'd frame as "side issues," when those systemic eval flaws are actually central to ongoing performance failure. IMO, one reason frontier labs avoid digging into failure modes too much is it will point to architecture issues that are not easily fixed. All of which comes down to the old saying, "It is hard to get somebody to understand something (e.g. the need for curiosity abt both success and failure) when his job depends on NOT understanding it. Also suspect the data contamination that quite conveniently allows models to score higher on benchmarks is not accidental, but a kind of contamination that is avidly sought for training data. Meta admitted as much.
I always enjoy your articles. This is a great read and the recorded keynote presentation is wonderful to watch.
I remember from your 2019 book a fantastic chapter called "Metaphors We Live By," which you cited was based on the eponymous book by Lakoff & Johnson from 1980. They say at the beginning of their book that "argument is war" metaphor can also be changed to a different metaphor: "argument is a dance."
So regarding your sixth principle and trying to promote publication of more negative results - I wonder if we need a new metaphor to fundamentally change the discussion?
It seems to me that the singular focus on novelty and positive results in the science publication industry is a version of the "argument is war" metaphor. But real progress only happens, as you point out, when we have a figurative situation where "argument is a dance."
Just thinking out loud :) Anyway, thanks for your great writing!
Cheers.
20 and more years ago, there was (and maybe still is) a saying in the IT industry that went "there are lies, damn lies and benchmarks", paraphrasing something Churchill purportedly said about statistics
Dr. Mitchell, great article and as usual, your accomplishments here are understated.
Your perturbation results suggest that analogy in current systems is conditionally evoked rather than structurally grounded, which helps explain why accuracy alone systematically overstates competence. That distinction feels like the missing invariant in much of benchmark discourse. What your analysis achieves is an empirical location of where that boundary actually sits.
Narrow benchmarks themselves aren’t the problem; systems improve along the dimensions that are incentivized, so the behavior we observe ultimately reflects which benchmarks are rewarded.
In my own work, I’ve seen a similar boundary in what I think of as the Alex-the-parrot regime: fluent symbolic behavior without an abstraction layer that survives representational re-encoding, and where coherence improves markedly when systems are designed around architectural constraints that distinguish surface competence from structurally grounded capacity.
You had me at 'clever Hans' :-) But that great illustration aside, your concept of 'approximate retrieval' crystallizes something I've been wrestling with since your Othello post. The 8x8→10x10 brittleness showed memorized structure, not abstracted causality. Now I see the mechanism more clearly: LLMs aren't building world models; they're interpolating within a vast, static space of training patterns.
What strikes me is the contrast in efficiency. The brain uses 20 watts precisely because it can't afford to store everything; it's forced to allocate attention dynamically and build causal models on the fly. The constraints aren't bugs; they're what drove modularity and plasticity. LLMs, unconstrained by energy or memory limits, took a different path: scale the interpolation space until it approximates understanding.
This suggests benchmarks might need to test not just performance, but efficiency under constraint, can the system maintain capability when we deliberately limit its resources? That's closer to how evolution tested intelligence.
I agree completely!
Has there been research that developed approaches where a system asks the question “how do I know the result is correct” or some other type of meta reasoning?
Not that I am aware of. It raises an interesting and potentially illuminating question - why not?
This is total conjecture on my part, but I wonder if doing so isn’t undesirable for the model owners. They might see this as undermining the value, ironically I believe it would be a point of differentiation. Furthermore, conjecture continues, perhaps this is harder than we might believe given the current architecture. If this is true, and as more people ask the question you raise, then they have a much bigger problem - their models have a massive and potentially catastrophic blind spot …. which most definitely isn’t going in any marketing or promotional copy.
Again, I want to be clear that I don’t know these things to be true, but they are I hope educated guesses.
Melanie - do you have any insight here?
Good points. Seems to me that the meta-question "how do I know the result is correct" is maybe harder than coming up with the original results.
Clever Hans never passed a bar exam.
Y'all are just SO obvious with your snowflake cope.
> This last principle says that one of the best things you can do is look at the 12% of items that your model got wrong. Analyzing the types of errors made by a system can be one of the best ways to get insight into how the model works.
Another surprise to me is that someone needed to say this out loud. During my software career I transitioned from ML/DataSci to security engineering. It’s *obvious* that the false negatives a system produces are more valuable for improving it than a raft of true (or even false) positive detections.
Thanks for the link to the ConceptARC paper.
This is a great article. I strongly agree that AI needs to be taken off the “innovation-driven” shelf and placed firmly on the “cognitive science” shelf. Treating AI primarily as a scientific object of study, rather than just a technological product, seems essential if we want to make real progress in understanding what these systems can (and cannot) actually do.
Absolutely. We are just now understanding that these GenAI systems can do more than even their designers and architects thought
This is a great article, I’m always happy when I see proper cross-disciplinary thinking with cognitive science in AI. Ever since Attention is All You Need, A/ML has completely sidelined that aspect, and the field has been the poorer for it. I also agree with your contention that current benchmarking has little connection to real-world practical cognition for practical use.
Very cool read and excitingly pragmatic. I find value in your encouragement to embrace negative results, as otherwise, taken poorly they dampen curiosity.
thanks 🙂
Worthy, indeed, of a Keynote. Everyone in the AI space should read it!
As you point out, there are many reasons that benchmarks are not a good indicator of how a model will perform outside of the benchmarks. I would add, that benchmarks are not experiments, they are lot logical as measures of what the model is doing to achieve the result. One of the reasons that they are not predictive.
Reliance on benchmarks is an example of the fallacy of affirming the consequent. If the model reasons, then it should be able to solve this problem. It solves the problem, therefore it reasons is fallacious reasoning. The cookies are missing from the cookie jar, therefore Melanie took them. Put another way, many states require a driver to pass a vision test. In California, the test consists of reading letters from a board. A driver could pass the test by having adequate vision or by memorizing the letters in each row. The latter person may not be able to see to drive, but both of them have passed the test. Which one do you want driving near your kids' playground?
You outline some careful experiments that attempt to control for alternative explanations. This kind of experimentation is sorely lacking in AI research. There is a huge history of careful experimenters asking the same question as AI research is facing today (including animal language research, concept learning). As you point out, there is also a lot of work on human cognition, much of it in the 1980s or so dedicated to finding out whether behaviorist (it is not necessary to talk about mind) accounts of human behavior are sufficient. In short, there is a large amount of thinking that can be tapped and reassessed in the context of today's AI issues.
Thank you for raising these issues. The field will be better if the participants pay attention to what you are telling them.
Here's a link to some related ideas: https://online.fliphtml5.com/ReliathAI/hepq/#p=1,
https://herbertroitblat.substack.com/p/what-in-the-world-is-a-world-model
Thank YOU.
Just FYI, your talk is available to me via the link you gave even though I am NOT registered at NeurIPS 2025. Thanks.
The real problem with applying human-designed cognitive tests like the WAIS intelligence test to LLMs isn’t anthropomorphism per se it’s that the normative framework is meaningless. A WAIS score means something because of what it predicts in a human life: educational outcomes, occupational functioning, adaptive capacity. Applying that metric to an LLM produces a number with no referent. An LLM may well score at the 99th percentile but that tells us nothing about its capacities on their own terms, because the score was never designed to predict anything beyond human functioning.
Cool to see you in the NYT piece. Great you say that its a myth AI has "magic" or "emergent" properties. I think the word "emergent" has come to be used in science overall as some sort of magical invocation... which to me is a sign that our current science is limited and simply can't cope with many real world phenomena.
Thank you for sharing and synthesizing this, Melanie. I do not consider myself a 'killjoy' scientist, but I have to say I am increasingly unhappy not to be able to explain when something is working (let alone not working). Working on evolutionary algorithms does this to you -- you reach for mechanisms (why, why not). I remember a very distinct experience at a conference last year. An engineering from Google (hope I am attributing this right) was making the argument that LLMs are solving graph-related problems. He gave his keynote and all were wowed, as usual. I raised my hand and asked whether he could understand what was going on. He immediately went to the negative results, trying to say the usual 'oh, we did not investigate these but will...future work being all that is -- probably never will" I had to interrupt him: "No, my apologies. On the stuff that worked, did you ask yourself why?" He had no answer. It left a distinct taste, like having over-paid for a bad cup of coffee.
Dear Professor Mitchell, this is really good article, thank you for writing it.
Please could you link your substack to your web page? I found this article a while ago by chance and wanted to refer to it today - I spent quite a while hunting for it on your website!
Anyway - great stuff, keep on keeping the fire alive!
There's a link already there -- maybe too hard to see?
Excellent as always. I've mailed you a copy of our Crochet Coral Reef book to SFI. Look frwd to see you when I move there.
Got the books -- thanks so much!!