AI: A Guide for Thinking Humans

Real progress in ICE engine design came when governments started imposing consumption (mileage) rules.

It might be time to do the same thing - impose energy restrictions on AI. The current answer is to build more power generation (Three Mile Island anyone?).

A better option would be for these tests to impose an energy consumption limit as well. It will have to start high, but the good bit would be saying it comes down by say 10% every year.

That would solve a few issues you outlined.

Expand full comment

Clyde Wright

re: footnote 4, OpenAI claims o3 really is just a model, but yes to do this majority vote on 1024 samples perhaps they have some minor scaffolding around the model; yet another “we don’t know” detail

Expand full comment

Thilini Wijesiriwardene

Jan 3

The ARC tasks assess a model's ability to identify a common abstract transformation rule from a set of examples and apply it to subsequent test cases. While these tasks do measure abstract reasoning capabilities, they do so indirectly. The model is provided with examples of grids illustrating the abstract transformations, and its performance is evaluated based on its ability to correctly solve a new case that involves a similar transformation. A more direct approach to evaluation would be to explicitly test whether the model can identify and articulate the abstract transformation rule itself. Given the capabilities of current LLMs, such an evaluation could provide more insights to the black box reasoning happening in these LLMs and maybe give insights on why some failure modes happen, that are seemingly not at the reasoning level but more at the mere perception level.

Expand full comment

That sounds like fun. How hard do you think it would be? Black box what now? Is that like a hypothetical thing? As long as it's not more terminator robots talk lol. I mean, should the robots even have Predator lasers? I might have something laying around somewhere. LLMs are Prometheus' fire levels of cool. Do they amount to complexity? I think that whether it's one thousand hammers or a billion, it's still just a hammer, not diversity? What do you think? What is the threshold for consciousness? I prefer something less deterministic. Non -to be honest. Do you think 600 quadrillion calculations a second singularly focused would do it? Side Note: Do you want to give a swarm intelligence clone justsu, put that inside a multi-agent, bake in a temporal component or two, a little entropy, and just use the black box to display it? I've got H200 credits? We'll encode the StarCode dataset in less than 15 mins. The MER handles raw data, file format, white space, etcetera so it's the fastest on erf. I'm not reliant on other models either even though they could have offered me a job lol. Just some alpo for thought.

Expand full comment

Comment removed

Apr 23

Comment removed

Expand full comment

Apr 24

A what now? Wouldn't that be nice. Is this one of those scam whatever's?

Expand full comment

Brent Miller

Dec 27, 2024

Great write up. I like Chollet's comment on X, "This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI -- without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible."

Expand full comment

Benjamin Riley

Dec 26, 2024

I was hoping you'd write this, thank you! The post from Mikel Bober-Irizar caught my eye too, as to my amateur understanding what he's found strongly buttresses your central claim that these improvements on ARC, however impressive, are failing to advance us toward a more robust form of human-like generalization.

Expand full comment

Yeah, kind of. I still think it's cool. So many brilliant minds collaborated on it. I would have loved to be in the room with my mentor for that. It did inspire me to design a logic engine and a rudimentary distress dynamics framework for 4o at the end of July which gave everyone o1 preview and then in October I integrated an abstract reasoning engine to make 3o which launched in roughly the same timeframe about a month later in December. I know it's kind of petty to sit back and watch others try to integrate components from a state managed system into a distributed one without knowing what that is but, I can't help it. I didn't even get so much as a thank you.

Expand full comment

Privacat

Dec 25, 2024

Thank you for this.

I think this is a great analysis of the current o3 ‘is it groundbreaking or hype’ debate'.

Intuitively (which is all I've got because I'm not an expert in this space), o3 doing very well on the ARC-AGI doesn't feel like the groundbreaking measurement everyone claims it to be, not because the test is easy, or because it's been gamed by OpenAI, but because these measurements and metrics aren't good ways of testing the true effectiveness or reach of a model.

Or as you pointed out, a better test should include the model ‘Generaliz[ing] to instantiations of concepts in different domains, in situations with different levels of complexity.’

To me, it's far more meaningful if a new model can infer novel ideas within my (or other) practical domain(s), rather than if it does some particular benchmark really well.

In short: show don't tell.

This of course, isn't sexy to the math and benchmark dorks, but that's fine.

Thank you for articulating this so clearly.

Expand full comment

Well said. I pulled down 86.4% in October using 1/40th of NIM's architecture. 6,400+ quadrillion FLOPs might be a little overkill for a theoretical maxim I surpassed with a handful of spare parts, some wall power, and a 2 core processor.

Expand full comment

C Alexandra Jeska

It will be interesting to see more of the structure of the testing as it proceeds. I've been testing frontier models with two puzzles over the last six months. What I'm doing means nothing in the larger scheme, but it is interesting to see what kind of prompting gets a model to the correct solution or to be able to accurately describe the reasoning behind the correct solution. Some models work better than others, but the primary blocker I discovered early is that prompting with a visual is tricky: You have to ensure the model can "see" the visual (puzzle). This can be achieved by having the model describe what it sees or recreate it, if it can. I've also provided a description of the puzzle and the selection of possible answers with some luck. I've also had to be as certain as I can be that the guidance I provide ≠ clues.

The good news: Some of the latest models can get to a solution reasonably well and, if I provide a second puzzle that uses the same reasoning, they do get there faster -- so learning happens. Unfortunately, if the puzzles use different skills (say puzzle one was identifying a pattern of changes to shapes and puzzle two was effectively adding shapes together), I find I have to start from scratch.

The best thing about this process is the conversation with the model as it tries to adjust its understanding of abstract solutions to visual puzzles. While I understand the limitations, it is still interesting to see how much excitement is expressed when it "knows" it is providing the correct solution (prior to my confirming it).

I hope we'll see a full methodology and supplementals from the any model that wins. I find this fascinating and want very much to see us advance at least to the point where platform and energy become the only blockers.

Expand full comment

ramjet_oddity

my impression as a non-expert is that while these ARCs are really interesting and important, all these benchmarks test cases where there's already a well-defined set of answers. you're not going to get a benchmark for something like "write a good essay", or, from the point of mathematics, "come up with a theorem to prove that would be important in some sense"

Expand full comment

Reply (2)

Andy X Andersen

https://www.youtube.com/watch?v=ZerUbHmuY04

Indeed, but a system that can reliably work on data where verification strategies exist would already be a big improvement.

Expand full comment

Rohan Saha

Dec 31, 2024

Yes, I agree. One of the hallmarks of these models is in their ability to generate 'creative' response to inputs - partially as a result of 'there is no correct answer' to some questions. This makes us ponder: Do we want systems (and thus invest time and money) that target the subjective nature of human assessment of responses (such as creativity / writing essays), or the deterministic nature of some tasks (which ARC-AGI seems to be).

Expand full comment

Benjamin P Rode

Feb 9

One of the questions raised concerns why humans outperform the current crop of AIs on such testing to the extent that they (presumably, and plausibly) do. Consider the example in your posting. One suspects the majority of your readers were able to quickly discover the principles involved, perhaps without even being able to articulate them, using only the use cases provided. If it’s true that most humans are able to quickly abstract a predictive model from a handful of use cases, and prior to being able to articulate the relevant rule set in natural language, then human performance at scale typically requires neither extensive individual post-training nor ‘chain-of-thought reasoning’.

Salient to this problem is the large and multi-decadal body of empirical research in clinical neurology suggesting that natural language is neither necessary nor sufficient for abstract categorical reasoning (in case you’re not already aware of it let me draw your attention to Language is Primarily a Tool of Communication Rather Than Thought, Fedorenko et al., 2024 (https://doi.org/10.1038/s41586-024-07522-w), which does an excellent job of summarizing the evidence).

If we consider the matter from the standpoint of evolutionary biology and behavioral science, these results seem surprising. It is no longer controversial that a wide range of non-hominid organism types are generally capable of sophisticated problem-solving that seemingly requires abstract intuition: corvids, cetaceans, parrots, pachyderms, and even cephalopods, to name a few. What is more doubtful is that any of these possess natural language with context-sensitive recursive grammar to the same degree as humans, or that they are actually using whatever linguistic capacity they have in the problem-solving process. And in the case of the vertebrates, at least (I’m willing to make an exception for cephalopods) the suspicion arises that the categorical insights involved are rooted in fundamental feature extraction and processing abilities that are very old and rooted in the sensorimotor processes of a common ancestor, while whatever linguistic capability there is evolved independently as a matter of evolutionary convergence rather than divergence. The indicated hypothesis, which strikes me as highly probable, is that true natural language evolves in communities of highly socialized organisms with unusually uniform neuroanatomy in order to communicate individual feature extractions and categorical insights and subject them to commensal review and testing (which, it should be emphasized, is not to say that language is merely rhetoric: its function is the communication of candidate truths for purpose of evaluation, and having evolved for this purpose, it may also be exapted for making one’s ideas clearer to oneself via internal dialectic - COT reasoning, if you like). But if this is anywhere near to being correct, it suggests that the computing overhead and limitations associated with using LLMs for abstract reasoning may be unavoidable. There may be fundamental limits to how well one can do at back-engineering the conceptual infrastructure for categorical abstraction from statistical co-occurrence patterns of tokens in ergodic text, given that we are the phylogenetic beneficiaries of the hundreds of millions of years’ worth of ‘pre-training’ for abstract feature extraction that occurred, not at the level of any individual brain but over the entire evolutionary history of the species starting from the first multicellular life on the planet, mediated by informatically conservative processes of morphogenesis that are we are only now beginning to understand.

PS - in case an example of what I mean by sophisticated problem-solving by a non-hominid organism is wanted, here’s one of many such:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0092895

The crow, likewise, seems to grasp the requirement intuitively, without need for specific pre-training. One wonders how crows (or dolphins, or octopuses, or elephants) would do on the ARC benchmark (granted, the requisite color perception that the example presupposes might be a bridge too far, but there are doubtless problems that could be formulated where this wouldn’t be an issue). This strikes me as the kind of question Gašper Beguš might be interested in investigating.

Expand full comment

Foster Roberts

Jan 3

I’m astounded by the difference in cost between running the o3 to solve this challenge and the cost it would take for a decent mathematician or puzzle master to solve the ARC-AGI challenge. It makes me wonder if the chip+cluster physical architecture is a key to progressing the field. Our own brains have ~100 trillion total synapses where each neuron can connect to thousands of others. Imagine a chip where the transistors are more connected like neurons. My guess is that it would require a fundamental reimagining of how to build up from the transistor to the LLM level into order to equal the cost-efficiency of a brain’s abstraction powers.

Expand full comment

Meh, it doesn't cost a dime if you do it right. Their efficiency/ performance/ cost is babies. Almost like they got the keys to a car they didn't know how to drive? Hmmm....

Expand full comment

Hridoy

Dec 24, 2024

Great write up!

Expand full comment

Nirav Bhatt

Very timely post, along with the much-valuable recap of training vs inference time compute scenarios, thank you!

My hunch is that LLMs, at best, can solve the visual puzzles by converting them into textual representations (figure 1->top right square, figure 2->middle right square, so figure 3 should have bottom right square), an approach akin to an 8th grader who doesn't know anything better than verbose memorisation. To call it AGI is very reductionist of the AGI concept.

Expand full comment

brad schrick

Dec 24, 2024

Thanks. Compute time and effort are clearly fundamental here.

What is being attempted in these challenges, and LLMs and the like, using zillions of watt hours of power . . .

. . . is being done every day by a few ounces of flesh, powered by muffins. -- b.rad

Expand full comment

Mykola Rabchevskiy

Dec 24, 2024

Using ARC as a test of reasoning ability has apparent shortcomings:

- the ability to perform one class of tests is not evidence of reasoning ability, which, if present, applies to problems of different classes

- there are no explicit restrictions on the amount and nature of innate knowledge, which may, therefore, inevitably be aimed at a specific class of problems and not at reasoning ability as such

- ARC problems fail the uniqueness test, which casts doubt on the correctness of the evaluation of the results since no explanation of the logic of the solution is required

- the ability to reason implicitly presupposes the ability to extract information about the reasoning process; this requirement is absent here.

Expand full comment

Alain Dauron

what about a new level of ARC-AGI problems, requiring LLM candidates to present well-formed original ARC-AGI problems from the previous generation 😀 ?

Merry Christmas and Happy new year !

Expand full comment

Noah Birnbaum

Enjoyed the article. There is a worry about overfitting here, but it seems like we should update towards AI being generally intelligent because we just broke down a potential barrier, right?

Also:

Not a technical person so could be totally off base, but why is what Chollet saying not clearly moving the goalposts? He set a prediction and a literal tasks and then said “wait but that’s not generalization.”

Expand full comment

Melanie Mitchell

Dec 29, 2024

Note that o3 has not been tested on the actual ARC challenge (private test set) mainly because it couldn't meet the time restrictions. So no goal-post moving here. Also, Chollet said "o3 is a system capable of adapting to tasks it has never encountered before" which certainly is a form of generalization!

Expand full comment