34 Comments

For me, the most important part is this: "when people focus on a specific target (e.g., top score on the ARC private evaluation set), they sometimes lose sight of what the benchmark is actually trying to measure (e.g., capacities for few-shot abstraction based on core knowledge concepts), and the methods developed to hit the target might miss the original motivation altogether."

The benchmark has "Abstract" in the title for a good reason. It is not supposed to be beaten through brute-force methods and sampling multiple answers, hoping one of them will work.

Expand full comment

Pretty easy to avoid these pitfalls in my future evaluation projects.

1. Add $1million in prize money

2. Name it AGI

Hah. Nice piece!

Expand full comment

In Chollet's defence, they had to gain popularity for the competition and the challenge, hence most likely the AGI naming and promotion as "a measure of intelligence". Chollet himself has pointed out that ARC is a necessary but not sufficient measure of intelligence and that it is most certainly not perfect as a test and so very likely possible to be solved through some sort of hacking or brute force. Their plan seems to be indeed to develop the test further and continue with the annual competitions.

Expand full comment

> if ARC can be solved largely via brute-force domain-specific search rather than via more general and humanlike abstraction

I expect that >85% on the semi-private evaluation set of ARC will first be achieved by a near-SOTA multimodal LLMs using some domain specific search (perhaps similar to my approach, but possibly doing search over some object other than programs, e.g. english language descriptions of the rule or some DSL).

(That is, I expect this is how this happens if it happens in the next 4 years with around 50% probability.)

Then, some of the credit will go to the language model being better at determining the rule (good enough to get the hardest problems perhaps 1/1,000 times or 1/10,000 times) and some credit will go to the part doing searching and checking.

Probably at this point, the language model will be good enough that you'll be able to get it to get around maybe 30 to 50% of the questions right on one pass through using relatively general purpose methods (e.g. general purpose agent scaffolding like what would have to be used for this sort of benchmark: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/).

Expand full comment

Very good post, thank you. As an aside, Charles Goodhart and I are friendly; we discuss central banking from time to time, and he even blurbed one of my books, Out of Crisis: Rethinking Our Financial Markets. He will be delighted to see this use of Goodhart's Law. I will let him know. Keep up the good work!

Expand full comment

Nice post, Mel! Out of interest, has anyone used the training set for Captchas?

Expand full comment

"My interest in ARC is its supposed ability to test humanlike abstraction (and understanding of core knowledge concepts), and it’s not clear to me that this method actually does any abstraction. "

I think you can be much firmer in your position.

There is absolutely no abstraction shown at all by the creation of thousands of different sets of python codes. There is no reasoning by the LLM.

This approach was purely an exercise in what used to be called genetic engineering. Produce vast numbers of slightly different models and slowly develop something new that works.

However, it doesn't show reasoning by the LLM, only enormous amounts of prompt engineering and effort by a human to force the recalcitrant LLM into some weird actions. Pointless and a con.

Expand full comment

> genetic engineering

Do you by any chance mean "genetic programming"? (ie, the field of AI where we create programs using evolution-inspired algorithms.) I think GP is a bit different from the data-augmentation + LLM approaches described above. But yes, typically GP does involve a lot of generate-and-test.

Expand full comment

>if ARC can be solved largely via brute-force domain-specific search rather than via more general and humanlike abstraction

I don't get how ARC can be solved via brute-force search. You would need to be able to recognize a solution when you had it, and a non-solution when you didn't. The arcprize website says the solutions are unique, but it does not give a mathematical definition (or a "testing oracle") for solutions. Rather, the unique solutions are undefined (unless I miss something).

If you see how ARC can be solved by brute force could you please explain the testing oracle, or something else I am missing? Thanks!

Expand full comment

One way it might be "solved" via brute-force search is (similar to the method of Greenblatt described in the post) to use an LLM to generate huge numbers of Python programs that transform an input grid to an output grid, use some human-designed heuristics for prompting the LLM and for searching this program space, test the programs on the demonstration transformations to see if they work, and then choose the program that both works on the demonstrations and fulfills some heuristics (e.g., shorter program is better), and run that program on the test input. Such an approach might eventually get to 85%, which is the ARC prize definition of "solved", but would not be a general approach to "abstraction and reasoning".

Expand full comment

This is a fantastic and thought-provoking post tying together a lot of interesting ideas and theories. Thank you for compiling all this intel in one place. We seem to be quite a ways away from AGI. An important reminder that human methodology can influence testing as much as the bleeding edge of research can.

Expand full comment

Yes, the contestants may well lose the actual point of intelligence, which is actually dealing with concepts. NNs still don't as far as I'm concerned. https://davidhsing.substack.com/p/llms-and-generative-ai-dont-deal

Expand full comment

we’re going to be playing the game where, like in Scooby-Doo, we pull off the “reasoning” mask and reveal “tree search hardcoded into model architecture” for a while yet, I think.

Expand full comment

> It turned out that Shlegeris and Greenblatt were confusing the public evaluation set with the private one.

We weren't confusing it exactly, though I was mistaken about the relative difficulty. (I was certainly aware that these are different sets.) As I noted with a footnote in the original post:

> Also, prior results are evaluated on a private test set, though this shouldn’t matter much since it’s supposed to be basically IID with the public test set.

I just assumed these sets were IID or practically IID as that is what the arc prize website seemed to imply. This has since been clarified.

Further context is here: https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o?commentId=XheMa2FNjjjdQGSqJ

Expand full comment

Thanks for replying. My comment about "confusing the public and private evaluation sets" came from Buck's tweet on the results: "Test set: 51% vs prior SoTA of 34% (human baseline is unknown)" --of course the 34% was on the *private* set, so not the right comparison. Similarly on calling 50% "SoTA". But I see what you mean about assuming they should have approx. the same difficulty. I'm not sure that Chollet intended the public and private test set to be "IID" -- in any case, I'm not sure what IID would even mean for this challenge.

Expand full comment

> in any case, I'm not sure what IID would even mean for this challenge.

If problems can be made highly diverse, then generating 600 problems via the same process and then randomly dividing between public/private should be fine. (And this is what I think benchmark creators should typically do: have a public split and private split which are IID.)

In some cases, it is hard/costly to make a large number of highly diverse problems (some problems would be too similar such that you are worried about effective contamination). In this case, one reasonable approach is to group problems into "families" of similar problems where each family is sufficiently distinct from the other families that you aren't worried about contamination, but the problems within the family would have too much overlap. Then, you randomly divide these families between public/private.

(There is probably a standard term for this task families thing in the statistics literature, but I'm not familiar enough to know it.)

One way to imagine this is to imagine that we randomly sample N problem families from the space of task families. But, if we only made one problem per family, our measurement of performance would be too high variance (or possibly we wouldn't have enough training data). So, we instead make many problems per family to reduce variance further as this is cheaper than making more totally distinct families.

Expand full comment

Yep, I agree the tweet is incorrect (or at the very least, misleading). Sorry about that.

Expand full comment

Goodhart's Law. Educators take note: it applies equally to the practice of teaching to the tests.

Expand full comment

The word intelligent is misleading. What is being pursued is thinking. Thinking isn’t taking something that already exists and rearranging it in new or different or better ways. These are all examples of changes. Everything in the universe changes all the time. Rocks change. Given even an infinite of time rocks are not going to change into thinking rocks. No amount of time and change will produce thinking. Conflating change with thinking keeps this illusion of a pursuit going.

Now of course there’s nothing wrong with change. Change will go on with or without humanity. On a human scale change is required.

What all the hubris about imitating an act of creation comes down to is recognition that change and creation are distinct. Creation comes from nothing while change is a rearrangement of something that already exists, that already was brought into existence.

Computing and the math behind it are like clever arrangements of falling dominos. You can learn much about the created by playing with it. What you can never learn from playing with it is how it came to be in the first place.

The problem of thinking about thinking is that it’s not about change but rather about bringing forth from nothing.

The problem with thinking about nothing is that nothing as real nothing is unthinkable. Real Nothing is ineffable. Is beyond languaging.

What most people call nothing is actually a something. Shoving stuff aside to make a hole isn’t nothing. That’s a nothing that is actually a something. Real nothing is unthinkable. Good luck creating a math around actual nothing and getting it to run on anything. Good luck even thinking about thinking about that.

What’s clear is the lack of clarity about what constitutes thinking. We’re devoting much resources and aiming at… Gee …, … hmmm …nothing?

We humans take ourselves much too seriously and not nearly seriously enough.

Expand full comment

> The better one is at abstraction, the less search one has to do.

I think I understand this intuitively, but is the idea that being able to abstract somehow constrains the search space? e.g. "a new thing x looks like y which I've seen before, so I only have to search the smaller space related to y"?

Expand full comment

Yes, or "I recognize that new thing x as being a kind of y", or "x is a y-situation". E.g., I'm driving and recognize the complicated scene in front of me as a "road construction" situation, which I have some prior knowledge how to deal with.

Expand full comment

> The better one is at abstraction, the less search one has to do.

I have difficulty agreeting with this. I don’t think you do less search, you just do it faster, either because you are more efficient in searching, or because you managed to ”cache” a larger set of working heuristics in your memeory (which is another way of being more efficient)

Expand full comment

It is not surprising that LLM cannot solve these challenges. LLM has a totally different architecture. It lacks any kind of spatial skills or reasoning abilities. LLM functions best in solving problems by imitation, and even for those it will likely get the details wrong without additional modeling or error checking.

Expand full comment