21 Comments

Real progress in ICE engine design came when governments started imposing consumption (mileage) rules.

It might be time to do the same thing - impose energy restrictions on AI. The current answer is to build more power generation (Three Mile Island anyone?).

A better option would be for these tests to impose an energy consumption limit as well. It will have to start high, but the good bit would be saying it comes down by say 10% every year.

That would solve a few issues you outlined.

Expand full comment

re: footnote 4, OpenAI claims o3 really is just a model, but yes to do this majority vote on 1024 samples perhaps they have some minor scaffolding around the model; yet another “we don’t know” detail

Expand full comment

my impression as a non-expert is that while these ARCs are really interesting and important, all these benchmarks test cases where there's already a well-defined set of answers. you're not going to get a benchmark for something like "write a good essay", or, from the point of mathematics, "come up with a theorem to prove that would be important in some sense"

Expand full comment

Indeed, but a system that can reliably work on data where verification strategies exist would already be a big improvement.

Expand full comment

It will be interesting to see more of the structure of the testing as it proceeds. I've been testing frontier models with two puzzles over the last six months. What I'm doing means nothing in the larger scheme, but it is interesting to see what kind of prompting gets a model to the correct solution or to be able to accurately describe the reasoning behind the correct solution. Some models work better than others, but the primary blocker I discovered early is that prompting with a visual is tricky: You have to ensure the model can "see" the visual (puzzle). This can be achieved by having the model describe what it sees or recreate it, if it can. I've also provided a description of the puzzle and the selection of possible answers with some luck. I've also had to be as certain as I can be that the guidance I provide ≠ clues.

The good news: Some of the latest models can get to a solution reasonably well and, if I provide a second puzzle that uses the same reasoning, they do get there faster -- so learning happens. Unfortunately, if the puzzles use different skills (say puzzle one was identifying a pattern of changes to shapes and puzzle two was effectively adding shapes together), I find I have to start from scratch.

The best thing about this process is the conversation with the model as it tries to adjust its understanding of abstract solutions to visual puzzles. While I understand the limitations, it is still interesting to see how much excitement is expressed when it "knows" it is providing the correct solution (prior to my confirming it).

I hope we'll see a full methodology and supplementals from the any model that wins. I find this fascinating and want very much to see us advance at least to the point where platform and energy become the only blockers.

Expand full comment

Very timely post, along with the much-valuable recap of training vs inference time compute scenarios, thank you!

My hunch is that LLMs, at best, can solve the visual puzzles by converting them into textual representations (figure 1->top right square, figure 2->middle right square, so figure 3 should have bottom right square), an approach akin to an 8th grader who doesn't know anything better than verbose memorisation. To call it AGI is very reductionist of the AGI concept.

Expand full comment

what about a new level of ARC-AGI problems, requiring LLM candidates to present well-formed original ARC-AGI problems from the previous generation 😀 ?

Merry Christmas and Happy new year !

Expand full comment

o3 is surely a step forward from the LLM model of one-shot statistical prediction. There's plenty of real world work where the paradigm of trying various strategies coupled with validation will apply.

While applying this approach can be computationally expensive, likely the strategies that were found to work for various classes of the problems can then be cached or added to training data, which should result in big speedups.

But way to go till AGI.

Expand full comment

"(2) At test (inference) time, to solve a task, augment the task’s demonstrations to fine-tune a LLM on that specific task." (One of the three strategies used by existing solution attempts)

I'm missing something. How is it possible to augment the task's demonstrations without first solving the task?

Expand full comment

Each of the 'training pairs' has a horizontally-flipped version that is likely valid.

Or (if no pixels are touching the edges) one could likely create a new version of an input pair with an extra edge of border pixels.

Each of these 'fake training pairs' carry some risk of being invalid, but each method of augmentation can be tested out by running it across the hidden test set on previous days of the Kaggle competition

Expand full comment

The winning score was actually 2nd place, because the team with highest score was able to opt-out (post competition) of the open-sourcing requirement. This is something that Chollet says will be addressed in next years event (as NOW they understand that some people just need the benchmark as "official recognition" of ability for funding reasons, despite the number of times I raised the issue with Knoop). That established brands, such as OpenAI, can get special access to the benchmark outside of competition window is another sticking point. And as a third thing, I don't think OpenAI did much different then regular entries, with exception on the amount of compute they could bring to bear. If other entries had access to a million in compute time, would there scores have been just as high as o3? Everyone assumes level playing field. It wasn't.

Expand full comment

I think you were right to use the term brute force. And why is OpenAI willing to spend so much money to win the prize? They mistakenly think they can claim AGI, and Saltman is ready to cash out and break free from his Microsoft shackles. This is all smoke and mirrors and at the end of the day they've spent $billions just to create a mimicking toy that mocks human creativity and empowers oligarchs to automate jobs away. So far there's no net benefit to humanity, which makes the whole endevor seem irresponsible and nihilistic.

Expand full comment

Awesome explanation as always!

Expand full comment

Thank you!

Expand full comment

Great write up!

Expand full comment

Thanks. Compute time and effort are clearly fundamental here.

What is being attempted in these challenges, and LLMs and the like, using zillions of watt hours of power . . .

. . . is being done every day by a few ounces of flesh, powered by muffins. -- b.rad

Expand full comment

About footnote 4, o3 is a CoT Prompting Agent, not a Model. We already know that GPT-5 is disappointing relative to its cost with significant delays for that reason.

Expand full comment

Using ARC as a test of reasoning ability has apparent shortcomings:

- the ability to perform one class of tests is not evidence of reasoning ability, which, if present, applies to problems of different classes

- there are no explicit restrictions on the amount and nature of innate knowledge, which may, therefore, inevitably be aimed at a specific class of problems and not at reasoning ability as such

- ARC problems fail the uniqueness test, which casts doubt on the correctness of the evaluation of the results since no explanation of the logic of the solution is required

- the ability to reason implicitly presupposes the ability to extract information about the reasoning process; this requirement is absent here.

Expand full comment

Thank you for the great insights. I would like get your opinion on something if you have time.

I am not sure what it means that o3 can solve these puzzles in terms of intelligence.

The ambiguity of some solutions (https://x.com/voooooogel/status/1870344012194009296?s=46&t=QMK1b8Sf8WVB_J0b9pjIUg) revealed that this test is measuring the model's similarity to human's abstract reasoning. Although there are many solutions to puzzles, it is testing for a reasoning that is closest to ours. If o3 is trained by human labeled chain of thought data (the evaluator you mentioned), it is no surprise to me that it can produce similar reasoning behaviour (still pretty cool though). Instead of fitting the model to the problems and its answers, it is fitted to the reasoning process itself. Although it does not sound like AGI, I am confused about what AGI looks like or what we are expecting these models to do. Without a definition of generalization or intelligence, we are creating benchmarks. When these big models crack these benchmarks in a different way than my expectations, I can't say how close it is to AGI or what is missing. It is pretty possible that machine intelligence would be different to ours because it won't be limited by a brain size memory and computing power.

Expand full comment