26 Comments
User's avatar
Josh Brake's avatar

Great paper. And I love this practice of writing about the science in a blog post and a more publicly accessible style!

Expand full comment
Renaud Gaudron's avatar

I 100% agree!

Expand full comment
jazzbox35's avatar

“inside vs. outside”, “top vs. bottom”, “same vs. different”... this is metaphysics to me. I've believed that these basic first principles are the way to start AGI. In fact, "same and different" are in Aristotle's Metaphysics, chapter five, #9. But different philosophers liked different structures of course.

Expand full comment
Paul Soldera's avatar

This is a great article. Thanks so much for summarizing the paper, it makes it so much more accessible to read a summary first! What strikes me is how hard it must be to create any sense of an abstract concept from just text input in the way an LLM is structured. Human input is so different. We create meaning from interacting in a 3D world where relative position is just baked into how we navigate things. So many of our abstract concepts are spatial or visual concepts.

Do you think a neural network that had some type of reinforcement learning system but the only input was visual (through a camera), would it create visual and spatial abstractions to help it navigate? Or is there simply some other type learning system we have that isn't just a neural net?

Expand full comment
Melanie Mitchell's avatar

Indeed, this is a question we are pursuing. I'm not sure reinforcement learning is enough. There might need to be some architectural bias towards spatial representations (as we seem to have in our brains).

Expand full comment
Matt Hawthorn's avatar

I agree. It seems natural that we would have evolved very strong priors toward "objectness" and spatial primitives in our neural architectures, just like translation invariance is baked into the visual cortex via something like convolution. There simply aren't enough resources - including time - for each new generation to induce these priors from raw experience alone, the way that we're expecting these models to do. The current model training regimes expend far more energy and require far more data to get to where they can begin to abstract, than it takes to get a human infant through the developmental stages to reach a similar level.

Expand full comment
Simon Crase's avatar

I wonder whether visual input is enough. People knew about hallucinations long before we had AI.

Expand full comment
Kevin's avatar

It seems so hard to do research on these things. o3 and Sonnet 4 aren't even available any more. (Or maybe I can get them through some API? I can't seem to get them through the regular consumer interface.)

Either way, I am convinced that the LLMs measured here are clearly subhuman on this sort of spatial reasoning task. There's also the matter of the input formatting. It seems like the LLM companies still have not "hooked together" their best general reasoning, and their best image processing. You can't just paste in an image and ask questions like this, you have to provide it in a textual form.

In some sense, though it feels like... can that really be that hard, to connect the models? Specifically for spatial reasoning it seems like the LLMs are likely to improve a lot in the near future.

Expand full comment
Andreas Stolcke's avatar

Great study. Given just two samples there are many ways to overfit the "training" data and come up with counter-intuitive generalizations. Humans are primed to prefer certain generalizations as much more "natural" or "parsimonious" based on a lifetime of dealing with visual stimuli. For example, when deciding how to draw the blue line (vertically or horizontally), any number of features could discriminate the two demonstrations. The demonstration itself primes us to pay attention to linear orientation, so therefore it is more natural to also look for a discriminating feature having to do with orientation. The question is whether prompting or pretraining with some kind of "life experience" can predispose generative models to show similar priors for rules as humans.

Expand full comment
Alain Dauron's avatar

It's funny how those AI models sometimes detect the right rule, but are unable to apply it. Why not offer them a symbolic extension 🤣?!

Expand full comment
Simon Crase's avatar

Alain, isn't rhat what people do, too? How often do we know the right rule, but do something else?

Expand full comment
Scott Francis's avatar

I think that difference in abstraction (object) vs. color values is significant. interesting research!

Expand full comment
Reuben Adams's avatar

Very interesting! I think this work points to an unfortunate trade-off. To go beyond accuracy, this kind of research requires open benchmarks. But to avoid training on the test set, we need closed benchmarks.

Expand full comment
Ian Varley's avatar

Since LLMs appear (to me at least) to have strong role-playing capabilities, I wonder if performance would change substantially if the models were given a prompt like “Approach this task the way a human would, applying visual object identification priors”, or something like that. It may be that more powerful and human-like reasoning capabilities are available in latent space, but aren’t strongly triggered by the experimental setup?

Expand full comment
Melanie Mitchell's avatar

I hope someone will try this!

Expand full comment
Matt Hawthorn's avatar

There's another measure of correctness I'm curious about: consistency between explanations and predictions, i.e. between the inductive and deductive operation of the models. We know from other studies (I remember one from Anthropic) that LLMs' explanations of how they accomplish certain tasks (e.g. arithmetic) differs greatly from how we observe them actually doing the same tasks in mechanistic interpretability studies. I think you alluded to this in the visual input case by noting that models were more often correct at induction (describing a correct rule) than deduction (producing a correct inference). In those cases, the induced rule as described in natural language has to be inconsistent with the deductions of specific grid outputs.

I'm curious how often this kind of inconsistency arises, and how it's affected by the experimental context. For instance, is there more inconsistency when you ask the same model to predict a grid and describe a rule in separate runs, vs when you ask the model to do both in the same output, or ask it to do one and then provide its output as context when asking it to do the other? E.g. "given the above problem description and your solution, describe the rule you used to produce it" or "given the problem description and the rule you induced from it, apply the rule to this example".

Expand full comment
AlexT's avatar

Does the footnote sort of imply the tests might have been part of the training data, or am I misunderstanding?

Expand full comment
Melanie Mitchell's avatar

It's clear that these models (especially o3) were trained on some versions of ARC. I'm not sure about our specific benchmark. Data contamination is always a problem unless you have a private test set, and then if you give the private test set to a proprietary model, it is no longer private!

Expand full comment
AlexT's avatar

Yeah, the way I'd want to do it is to constantly make up new tests, similar but differentiated enough that training on the old set provides no advantage, or even actively hinders the mindless cheater.

Expand full comment
Preston Cole Johnson's avatar

Yes. I’ve done it myself.

Expand full comment
Renaud Gaudron's avatar

Great article! It would be really interesting to see how your findings extend to the newer ARC-AGI-2, where the level of abstraction is even higher.

Expand full comment
loix's avatar

Thanks for making both the summary and the article available! I wish more AI research were like this one, i.e., focusing on teasing out different capabilities instead of obsessing on metrics alone. I also appreciated that the summary had the demos that I could independently test out without the "crib notes" and answer keys.

As a serious user who chats with all three AI on a daily basis about mostly cerebral topics (but not an AI researcher), I found it puzzling that Gemini Pro was the only advanced model that was tested (as opposed to GPT-5 or Opus 4), since Gemini Pro has the most accurate visual perception of the three. It would have been helpful and economical (saved the researchers extra turns and tokens) to start out each round of testing by having the models describe the input, since if the perception is incorrect, then any result downstream of that is meaningless.

Expand full comment
Simon Crase's avatar

Today I noticed a 2 year old boy who was studying fluid dynamics in the shopping plaza. He was playing with a fountain that randomly moved between one and a dozen jets, bashing the jets to see what happened. IMHO objects are difficult: the toddler had to process visual, tactile, and audio information. And, of course, his mother was there, making sure nothing bad happened. I wonder how an AI can be expected to infer objectness.

Expand full comment
SorenJ's avatar

The website https://arcprize.org/leaderboard seems to imply that o3 is just under human performance? And that is only with o3 (long) -- something which was never made available publicly and uses an absurd amount of compute?

Anyway... Setting aside the question of whether or not their absolute performance right now is better or worse than human (which given my quibbles above I would still say are subhuman), the more interesting thing to me is to note that you can clearly observe the cost/task decreasing over time, the accuracy/task increasing over time, and the accuracy/task increasing with compute/token scaling. So, assuming the trends hold, at some point in the not too distant future we should expect this benchmark to be saturated.

What then?

Does that imply that AI Reasoning models now abstract and reason like humans, but before they didn't? There's a weakness in the framing. I don't think there will be a point in time in which they "reason like humans." Instead, there is just a single feature called "intelligence", and right now humans are more intelligent overall (but especially for tasks like this; not so much for trivia), and human reasoning just looks better because it is more "intelligent." So as the AI systems get smarter they will start to look like they reason more and more like humans. Eventually they may surprass us and our reasoning will look shallow compared to theirs.

Expand full comment
AlexT's avatar

If the trend holds, sure. That's the rub, though, right? It's slowing down pretty badly, AIUI.

Expand full comment
SorenJ's avatar

I don’t think it’s actually slowing down. That’s certainly a popular narrative though. But on objective measures (like the METR benchmark and other benchmakrks) that hasn’t been shown to be true

Expand full comment