29 Comments
User's avatar
Josh Brake's avatar

Great paper. And I love this practice of writing about the science in a blog post and a more publicly accessible style!

Expand full comment
Renaud Gaudron's avatar

I 100% agree!

Expand full comment
jazzbox35's avatar

“inside vs. outside”, “top vs. bottom”, “same vs. different”... this is metaphysics to me. I've believed that these basic first principles are the way to start AGI. In fact, "same and different" are in Aristotle's Metaphysics, chapter five, #9. But different philosophers liked different structures of course.

Expand full comment
Paul Soldera's avatar

This is a great article. Thanks so much for summarizing the paper, it makes it so much more accessible to read a summary first! What strikes me is how hard it must be to create any sense of an abstract concept from just text input in the way an LLM is structured. Human input is so different. We create meaning from interacting in a 3D world where relative position is just baked into how we navigate things. So many of our abstract concepts are spatial or visual concepts.

Do you think a neural network that had some type of reinforcement learning system but the only input was visual (through a camera), would it create visual and spatial abstractions to help it navigate? Or is there simply some other type learning system we have that isn't just a neural net?

Expand full comment
Melanie Mitchell's avatar

Indeed, this is a question we are pursuing. I'm not sure reinforcement learning is enough. There might need to be some architectural bias towards spatial representations (as we seem to have in our brains).

Expand full comment
Matt Hawthorn's avatar

I agree. It seems natural that we would have evolved very strong priors toward "objectness" and spatial primitives in our neural architectures, just like translation invariance is baked into the visual cortex via something like convolution. There simply aren't enough resources - including time - for each new generation to induce these priors from raw experience alone, the way that we're expecting these models to do. The current model training regimes expend far more energy and require far more data to get to where they can begin to abstract, than it takes to get a human infant through the developmental stages to reach a similar level.

Expand full comment
Simon Crase's avatar

I wonder whether visual input is enough. People knew about hallucinations long before we had AI.

Expand full comment
Kevin's avatar

It seems so hard to do research on these things. o3 and Sonnet 4 aren't even available any more. (Or maybe I can get them through some API? I can't seem to get them through the regular consumer interface.)

Either way, I am convinced that the LLMs measured here are clearly subhuman on this sort of spatial reasoning task. There's also the matter of the input formatting. It seems like the LLM companies still have not "hooked together" their best general reasoning, and their best image processing. You can't just paste in an image and ask questions like this, you have to provide it in a textual form.

In some sense, though it feels like... can that really be that hard, to connect the models? Specifically for spatial reasoning it seems like the LLMs are likely to improve a lot in the near future.

Expand full comment
Andreas Stolcke's avatar

Great study. Given just two samples there are many ways to overfit the "training" data and come up with counter-intuitive generalizations. Humans are primed to prefer certain generalizations as much more "natural" or "parsimonious" based on a lifetime of dealing with visual stimuli. For example, when deciding how to draw the blue line (vertically or horizontally), any number of features could discriminate the two demonstrations. The demonstration itself primes us to pay attention to linear orientation, so therefore it is more natural to also look for a discriminating feature having to do with orientation. The question is whether prompting or pretraining with some kind of "life experience" can predispose generative models to show similar priors for rules as humans.

Expand full comment
Alain Dauron's avatar

It's funny how those AI models sometimes detect the right rule, but are unable to apply it. Why not offer them a symbolic extension 🤣?!

Expand full comment
Simon Crase's avatar

Alain, isn't rhat what people do, too? How often do we know the right rule, but do something else?

Expand full comment
Scott Francis's avatar

I think that difference in abstraction (object) vs. color values is significant. interesting research!

Expand full comment
Alison Gerber's avatar

I'm writing a dissertation on AI and Preaching. I cannot tell you how much your writing and thinking is helping me. Thank you so much! Please keep writing!

Expand full comment
Reuben Adams's avatar

Very interesting! I think this work points to an unfortunate trade-off. To go beyond accuracy, this kind of research requires open benchmarks. But to avoid training on the test set, we need closed benchmarks.

Expand full comment
Ian Varley's avatar

Since LLMs appear (to me at least) to have strong role-playing capabilities, I wonder if performance would change substantially if the models were given a prompt like “Approach this task the way a human would, applying visual object identification priors”, or something like that. It may be that more powerful and human-like reasoning capabilities are available in latent space, but aren’t strongly triggered by the experimental setup?

Expand full comment
Melanie Mitchell's avatar

I hope someone will try this!

Expand full comment
AlexT's avatar

Does the footnote sort of imply the tests might have been part of the training data, or am I misunderstanding?

Expand full comment
Melanie Mitchell's avatar

It's clear that these models (especially o3) were trained on some versions of ARC. I'm not sure about our specific benchmark. Data contamination is always a problem unless you have a private test set, and then if you give the private test set to a proprietary model, it is no longer private!

Expand full comment
AlexT's avatar

Yeah, the way I'd want to do it is to constantly make up new tests, similar but differentiated enough that training on the old set provides no advantage, or even actively hinders the mindless cheater.

Expand full comment
Renaud Gaudron's avatar

Great article! It would be really interesting to see how your findings extend to the newer ARC-AGI-2, where the level of abstraction is even higher.

Expand full comment
Matt Hawthorn's avatar

There's another measure of correctness I'm curious about: consistency between explanations and predictions, i.e. between the inductive and deductive operation of the models. We know from other studies (I remember one from Anthropic) that LLMs' explanations of how they accomplish certain tasks (e.g. arithmetic) differs greatly from how we observe them actually doing the same tasks in mechanistic interpretability studies. I think you alluded to this in the visual input case by noting that models were more often correct at induction (describing a correct rule) than deduction (producing a correct inference). In those cases, the induced rule as described in natural language has to be inconsistent with the deductions of specific grid outputs.

I'm curious how often this kind of inconsistency arises, and how it's affected by the experimental context. For instance, is there more inconsistency when you ask the same model to predict a grid and describe a rule in separate runs, vs when you ask the model to do both in the same output, or ask it to do one and then provide its output as context when asking it to do the other? E.g. "given the above problem description and your solution, describe the rule you used to produce it" or "given the problem description and the rule you induced from it, apply the rule to this example".

Expand full comment
Preston Cole Johnson's avatar

Yes. I’ve done it myself.

Expand full comment
Naina Chaturvedi's avatar

++ Good Post, Also, start here 100+ Most Asked ML System Design Case Studies and LLM System Design

https://open.substack.com/pub/naina0405/p/bookmark-most-asked-ml-system-design?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Expand full comment
Oleg  Alexandrov's avatar

This is very impressive work, and I agree that it is very important to know just how deep the understanding of these models go.

In practice, one can't just assume the models magically generalize or that they have the full visual understanding people have.

Expand full comment