Do AI Reasoning Models Abstract and Reason…

Melanie Mitchell

Oct 6

Going beyond simple accuracy for evaluating abstraction abilities

Read →

29 Comments

Josh Brake

Oct 7Edited

Great paper. And I love this practice of writing about the science in a blog post and a more publicly accessible style!

Expand full comment

Reply (1)

Renaud Gaudron

Oct 10

I 100% agree!

Expand full comment

jazzbox35

Oct 6

“inside vs. outside”, “top vs. bottom”, “same vs. different”... this is metaphysics to me. I've believed that these basic first principles are the way to start AGI. In fact, "same and different" are in Aristotle's Metaphysics, chapter five, #9. But different philosophers liked different structures of course.

Expand full comment

Paul Soldera

Oct 7

This is a great article. Thanks so much for summarizing the paper, it makes it so much more accessible to read a summary first! What strikes me is how hard it must be to create any sense of an abstract concept from just text input in the way an LLM is structured. Human input is so different. We create meaning from interacting in a 3D world where relative position is just baked into how we navigate things. So many of our abstract concepts are spatial or visual concepts.

Do you think a neural network that had some type of reinforcement learning system but the only input was visual (through a camera), would it create visual and spatial abstractions to help it navigate? Or is there simply some other type learning system we have that isn't just a neural net?

Expand full comment

Reply (2)

Melanie Mitchell

Oct 7

Indeed, this is a question we are pursuing. I'm not sure reinforcement learning is enough. There might need to be some architectural bias towards spatial representations (as we seem to have in our brains).

Expand full comment

Reply (1)

Matt Hawthorn

Oct 8

I agree. It seems natural that we would have evolved very strong priors toward "objectness" and spatial primitives in our neural architectures, just like translation invariance is baked into the visual cortex via something like convolution. There simply aren't enough resources - including time - for each new generation to induce these priors from raw experience alone, the way that we're expecting these models to do. The current model training regimes expend far more energy and require far more data to get to where they can begin to abstract, than it takes to get a human infant through the developmental stages to reach a similar level.

Expand full comment

Simon Crase

Oct 8

I wonder whether visual input is enough. People knew about hallucinations long before we had AI.

Expand full comment

Kevin

Oct 6

It seems so hard to do research on these things. o3 and Sonnet 4 aren't even available any more. (Or maybe I can get them through some API? I can't seem to get them through the regular consumer interface.)

Either way, I am convinced that the LLMs measured here are clearly subhuman on this sort of spatial reasoning task. There's also the matter of the input formatting. It seems like the LLM companies still have not "hooked together" their best general reasoning, and their best image processing. You can't just paste in an image and ask questions like this, you have to provide it in a textual form.

In some sense, though it feels like... can that really be that hard, to connect the models? Specifically for spatial reasoning it seems like the LLMs are likely to improve a lot in the near future.

Expand full comment

Andreas Stolcke

Oct 7

Great study. Given just two samples there are many ways to overfit the "training" data and come up with counter-intuitive generalizations. Humans are primed to prefer certain generalizations as much more "natural" or "parsimonious" based on a lifetime of dealing with visual stimuli. For example, when deciding how to draw the blue line (vertically or horizontally), any number of features could discriminate the two demonstrations. The demonstration itself primes us to pay attention to linear orientation, so therefore it is more natural to also look for a discriminating feature having to do with orientation. The question is whether prompting or pretraining with some kind of "life experience" can predispose generative models to show similar priors for rules as humans.

Expand full comment

Alain Dauron

Oct 6

It's funny how those AI models sometimes detect the right rule, but are unable to apply it. Why not offer them a symbolic extension 🤣?!

Expand full comment

Reply (1)

Simon Crase

Oct 8

Alain, isn't rhat what people do, too? How often do we know the right rule, but do something else?

Expand full comment

Scott Francis

Oct 6

I think that difference in abstraction (object) vs. color values is significant. interesting research!

Expand full comment

Alison Gerber

Oct 30

I'm writing a dissertation on AI and Preaching. I cannot tell you how much your writing and thinking is helping me. Thank you so much! Please keep writing!

Expand full comment

Reuben Adams

Oct 7

Very interesting! I think this work points to an unfortunate trade-off. To go beyond accuracy, this kind of research requires open benchmarks. But to avoid training on the test set, we need closed benchmarks.

Expand full comment

Ian Varley

Oct 7

Since LLMs appear (to me at least) to have strong role-playing capabilities, I wonder if performance would change substantially if the models were given a prompt like “Approach this task the way a human would, applying visual object identification priors”, or something like that. It may be that more powerful and human-like reasoning capabilities are available in latent space, but aren’t strongly triggered by the experimental setup?

Expand full comment

Reply (1)

Melanie Mitchell

Oct 7

I hope someone will try this!

Expand full comment

AlexT

Oct 7

Does the footnote sort of imply the tests might have been part of the training data, or am I misunderstanding?

Expand full comment

Reply (1)

Melanie Mitchell

Oct 7

It's clear that these models (especially o3) were trained on some versions of ARC. I'm not sure about our specific benchmark. Data contamination is always a problem unless you have a private test set, and then if you give the private test set to a proprietary model, it is no longer private!

Expand full comment

Reply (1)

AlexT

Oct 8

Yeah, the way I'd want to do it is to constantly make up new tests, similar but differentiated enough that training on the old set provides no advantage, or even actively hinders the mindless cheater.

Expand full comment

Renaud Gaudron

Oct 10

Great article! It would be really interesting to see how your findings extend to the newer ARC-AGI-2, where the level of abstraction is even higher.

Expand full comment

Matt Hawthorn

Oct 8

There's another measure of correctness I'm curious about: consistency between explanations and predictions, i.e. between the inductive and deductive operation of the models. We know from other studies (I remember one from Anthropic) that LLMs' explanations of how they accomplish certain tasks (e.g. arithmetic) differs greatly from how we observe them actually doing the same tasks in mechanistic interpretability studies. I think you alluded to this in the visual input case by noting that models were more often correct at induction (describing a correct rule) than deduction (producing a correct inference). In those cases, the induced rule as described in natural language has to be inconsistent with the deductions of specific grid outputs.

I'm curious how often this kind of inconsistency arises, and how it's affected by the experimental context. For instance, is there more inconsistency when you ask the same model to predict a grid and describe a rule in separate runs, vs when you ask the model to do both in the same output, or ask it to do one and then provide its output as context when asking it to do the other? E.g. "given the above problem description and your solution, describe the rule you used to produce it" or "given the problem description and the rule you induced from it, apply the rule to this example".

Expand full comment