36 Comments

Real progress in ICE engine design came when governments started imposing consumption (mileage) rules.

It might be time to do the same thing - impose energy restrictions on AI. The current answer is to build more power generation (Three Mile Island anyone?).

A better option would be for these tests to impose an energy consumption limit as well. It will have to start high, but the good bit would be saying it comes down by say 10% every year.

That would solve a few issues you outlined.

Expand full comment

re: footnote 4, OpenAI claims o3 really is just a model, but yes to do this majority vote on 1024 samples perhaps they have some minor scaffolding around the model; yet another “we don’t know” detail

Expand full comment

It will be interesting to see more of the structure of the testing as it proceeds. I've been testing frontier models with two puzzles over the last six months. What I'm doing means nothing in the larger scheme, but it is interesting to see what kind of prompting gets a model to the correct solution or to be able to accurately describe the reasoning behind the correct solution. Some models work better than others, but the primary blocker I discovered early is that prompting with a visual is tricky: You have to ensure the model can "see" the visual (puzzle). This can be achieved by having the model describe what it sees or recreate it, if it can. I've also provided a description of the puzzle and the selection of possible answers with some luck. I've also had to be as certain as I can be that the guidance I provide ≠ clues.

The good news: Some of the latest models can get to a solution reasonably well and, if I provide a second puzzle that uses the same reasoning, they do get there faster -- so learning happens. Unfortunately, if the puzzles use different skills (say puzzle one was identifying a pattern of changes to shapes and puzzle two was effectively adding shapes together), I find I have to start from scratch.

The best thing about this process is the conversation with the model as it tries to adjust its understanding of abstract solutions to visual puzzles. While I understand the limitations, it is still interesting to see how much excitement is expressed when it "knows" it is providing the correct solution (prior to my confirming it).

I hope we'll see a full methodology and supplementals from the any model that wins. I find this fascinating and want very much to see us advance at least to the point where platform and energy become the only blockers.

Expand full comment

my impression as a non-expert is that while these ARCs are really interesting and important, all these benchmarks test cases where there's already a well-defined set of answers. you're not going to get a benchmark for something like "write a good essay", or, from the point of mathematics, "come up with a theorem to prove that would be important in some sense"

Expand full comment

Indeed, but a system that can reliably work on data where verification strategies exist would already be a big improvement.

Expand full comment

Yes, I agree. One of the hallmarks of these models is in their ability to generate 'creative' response to inputs - partially as a result of 'there is no correct answer' to some questions. This makes us ponder: Do we want systems (and thus invest time and money) that target the subjective nature of human assessment of responses (such as creativity / writing essays), or the deterministic nature of some tasks (which ARC-AGI seems to be).

Expand full comment

Very timely post, along with the much-valuable recap of training vs inference time compute scenarios, thank you!

My hunch is that LLMs, at best, can solve the visual puzzles by converting them into textual representations (figure 1->top right square, figure 2->middle right square, so figure 3 should have bottom right square), an approach akin to an 8th grader who doesn't know anything better than verbose memorisation. To call it AGI is very reductionist of the AGI concept.

Expand full comment

The ARC tasks assess a model's ability to identify a common abstract transformation rule from a set of examples and apply it to subsequent test cases. While these tasks do measure abstract reasoning capabilities, they do so indirectly. The model is provided with examples of grids illustrating the abstract transformations, and its performance is evaluated based on its ability to correctly solve a new case that involves a similar transformation. A more direct approach to evaluation would be to explicitly test whether the model can identify and articulate the abstract transformation rule itself. Given the capabilities of current LLMs, such an evaluation could provide more insights to the black box reasoning happening in these LLMs and maybe give insights on why some failure modes happen, that are seemingly not at the reasoning level but more at the mere perception level.

Expand full comment

I’m astounded by the difference in cost between running the o3 to solve this challenge and the cost it would take for a decent mathematician or puzzle master to solve the ARC-AGI challenge. It makes me wonder if the chip+cluster physical architecture is a key to progressing the field. Our own brains have ~100 trillion total synapses where each neuron can connect to thousands of others. Imagine a chip where the transistors are more connected like neurons. My guess is that it would require a fundamental reimagining of how to build up from the transistor to the LLM level into order to equal the cost-efficiency of a brain’s abstraction powers.

Expand full comment

Great write up. I like Chollet's comment on X, "This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI -- without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible."

Expand full comment

I was hoping you'd write this, thank you! The post from Mikel Bober-Irizar caught my eye too, as to my amateur understanding what he's found strongly buttresses your central claim that these improvements on ARC, however impressive, are failing to advance us toward a more robust form of human-like generalization.

Expand full comment

Thank you for this.

I think this is a great analysis of the current o3 ‘is it groundbreaking or hype’ debate'.

Intuitively (which is all I've got because I'm not an expert in this space), o3 doing very well on the ARC-AGI doesn't feel like the groundbreaking measurement everyone claims it to be, not because the test is easy, or because it's been gamed by OpenAI, but because these measurements and metrics aren't good ways of testing the true effectiveness or reach of a model.

Or as you pointed out, a better test should include the model ‘Generaliz[ing] to instantiations of concepts in different domains, in situations with different levels of complexity.’

To me, it's far more meaningful if a new model can infer novel ideas within my (or other) practical domain(s), rather than if it does some particular benchmark really well.

In short: show don't tell.

This of course, isn't sexy to the math and benchmark dorks, but that's fine.

Thank you for articulating this so clearly.

Expand full comment

Great write up!

Expand full comment

Thanks. Compute time and effort are clearly fundamental here.

What is being attempted in these challenges, and LLMs and the like, using zillions of watt hours of power . . .

. . . is being done every day by a few ounces of flesh, powered by muffins. -- b.rad

Expand full comment

Using ARC as a test of reasoning ability has apparent shortcomings:

- the ability to perform one class of tests is not evidence of reasoning ability, which, if present, applies to problems of different classes

- there are no explicit restrictions on the amount and nature of innate knowledge, which may, therefore, inevitably be aimed at a specific class of problems and not at reasoning ability as such

- ARC problems fail the uniqueness test, which casts doubt on the correctness of the evaluation of the results since no explanation of the logic of the solution is required

- the ability to reason implicitly presupposes the ability to extract information about the reasoning process; this requirement is absent here.

Expand full comment

what about a new level of ARC-AGI problems, requiring LLM candidates to present well-formed original ARC-AGI problems from the previous generation 😀 ?

Merry Christmas and Happy new year !

Expand full comment

Enjoyed the article. There is a worry about overfitting here, but it seems like we should update towards AI being generally intelligent because we just broke down a potential barrier, right?

Also:

Not a technical person so could be totally off base, but why is what Chollet saying not clearly moving the goalposts? He set a prediction and a literal tasks and then said “wait but that’s not generalization.”

Expand full comment

Note that o3 has not been tested on the actual ARC challenge (private test set) mainly because it couldn't meet the time restrictions. So no goal-post moving here. Also, Chollet said "o3 is a system capable of adapting to tasks it has never encountered before" which certainly is a form of generalization!

Expand full comment

Got it. Thanks

Expand full comment

o3 is surely a step forward from the LLM model of one-shot statistical prediction. There's plenty of real world work where the paradigm of trying various strategies coupled with validation will apply.

While applying this approach can be computationally expensive, likely the strategies that were found to work for various classes of the problems can then be cached or added to training data, which should result in big speedups.

But way to go till AGI.

Expand full comment