In this post I’m going to go into the weeds, describing how some people are trying to win a big $$$ prize for solving a still-wide-open AI challenge, the “Abstraction and Reasoning Corpus,” and what it all means.
For me, the most important part is this: "when people focus on a specific target (e.g., top score on the ARC private evaluation set), they sometimes lose sight of what the benchmark is actually trying to measure (e.g., capacities for few-shot abstraction based on core knowledge concepts), and the methods developed to hit the target might miss the original motivation altogether."
The benchmark has "Abstract" in the title for a good reason. It is not supposed to be beaten through brute-force methods and sampling multiple answers, hoping one of them will work.
In Chollet's defence, they had to gain popularity for the competition and the challenge, hence most likely the AGI naming and promotion as "a measure of intelligence". Chollet himself has pointed out that ARC is a necessary but not sufficient measure of intelligence and that it is most certainly not perfect as a test and so very likely possible to be solved through some sort of hacking or brute force. Their plan seems to be indeed to develop the test further and continue with the annual competitions.
> if ARC can be solved largely via brute-force domain-specific search rather than via more general and humanlike abstraction
I expect that >85% on the semi-private evaluation set of ARC will first be achieved by a near-SOTA multimodal LLMs using some domain specific search (perhaps similar to my approach, but possibly doing search over some object other than programs, e.g. english language descriptions of the rule or some DSL).
(That is, I expect this is how this happens if it happens in the next 4 years with around 50% probability.)
Then, some of the credit will go to the language model being better at determining the rule (good enough to get the hardest problems perhaps 1/1,000 times or 1/10,000 times) and some credit will go to the part doing searching and checking.
Probably at this point, the language model will be good enough that you'll be able to get it to get around maybe 30 to 50% of the questions right on one pass through using relatively general purpose methods (e.g. general purpose agent scaffolding like what would have to be used for this sort of benchmark: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/).
Very good post, thank you. As an aside, Charles Goodhart and I are friendly; we discuss central banking from time to time, and he even blurbed one of my books, Out of Crisis: Rethinking Our Financial Markets. He will be delighted to see this use of Goodhart's Law. I will let him know. Keep up the good work!
"My interest in ARC is its supposed ability to test humanlike abstraction (and understanding of core knowledge concepts), and it’s not clear to me that this method actually does any abstraction. "
I think you can be much firmer in your position.
There is absolutely no abstraction shown at all by the creation of thousands of different sets of python codes. There is no reasoning by the LLM.
This approach was purely an exercise in what used to be called genetic engineering. Produce vast numbers of slightly different models and slowly develop something new that works.
However, it doesn't show reasoning by the LLM, only enormous amounts of prompt engineering and effort by a human to force the recalcitrant LLM into some weird actions. Pointless and a con.
Do you by any chance mean "genetic programming"? (ie, the field of AI where we create programs using evolution-inspired algorithms.) I think GP is a bit different from the data-augmentation + LLM approaches described above. But yes, typically GP does involve a lot of generate-and-test.
This is a fantastic and thought-provoking post tying together a lot of interesting ideas and theories. Thank you for compiling all this intel in one place. We seem to be quite a ways away from AGI. An important reminder that human methodology can influence testing as much as the bleeding edge of research can.
we’re going to be playing the game where, like in Scooby-Doo, we pull off the “reasoning” mask and reveal “tree search hardcoded into model architecture” for a while yet, I think.
> It turned out that Shlegeris and Greenblatt were confusing the public evaluation set with the private one.
We weren't confusing it exactly, though I was mistaken about the relative difficulty. (I was certainly aware that these are different sets.) As I noted with a footnote in the original post:
> Also, prior results are evaluated on a private test set, though this shouldn’t matter much since it’s supposed to be basically IID with the public test set.
I just assumed these sets were IID or practically IID as that is what the arc prize website seemed to imply. This has since been clarified.
Thanks for replying. My comment about "confusing the public and private evaluation sets" came from Buck's tweet on the results: "Test set: 51% vs prior SoTA of 34% (human baseline is unknown)" --of course the 34% was on the *private* set, so not the right comparison. Similarly on calling 50% "SoTA". But I see what you mean about assuming they should have approx. the same difficulty. I'm not sure that Chollet intended the public and private test set to be "IID" -- in any case, I'm not sure what IID would even mean for this challenge.
> in any case, I'm not sure what IID would even mean for this challenge.
If problems can be made highly diverse, then generating 600 problems via the same process and then randomly dividing between public/private should be fine. (And this is what I think benchmark creators should typically do: have a public split and private split which are IID.)
In some cases, it is hard/costly to make a large number of highly diverse problems (some problems would be too similar such that you are worried about effective contamination). In this case, one reasonable approach is to group problems into "families" of similar problems where each family is sufficiently distinct from the other families that you aren't worried about contamination, but the problems within the family would have too much overlap. Then, you randomly divide these families between public/private.
(There is probably a standard term for this task families thing in the statistics literature, but I'm not familiar enough to know it.)
One way to imagine this is to imagine that we randomly sample N problem families from the space of task families. But, if we only made one problem per family, our measurement of performance would be too high variance (or possibly we wouldn't have enough training data). So, we instead make many problems per family to reduce variance further as this is cheaper than making more totally distinct families.
The word intelligent is misleading. What is being pursued is thinking. Thinking isn’t taking something that already exists and rearranging it in new or different or better ways. These are all examples of changes. Everything in the universe changes all the time. Rocks change. Given even an infinite of time rocks are not going to change into thinking rocks. No amount of time and change will produce thinking. Conflating change with thinking keeps this illusion of a pursuit going.
Now of course there’s nothing wrong with change. Change will go on with or without humanity. On a human scale change is required.
What all the hubris about imitating an act of creation comes down to is recognition that change and creation are distinct. Creation comes from nothing while change is a rearrangement of something that already exists, that already was brought into existence.
Computing and the math behind it are like clever arrangements of falling dominos. You can learn much about the created by playing with it. What you can never learn from playing with it is how it came to be in the first place.
The problem of thinking about thinking is that it’s not about change but rather about bringing forth from nothing.
The problem with thinking about nothing is that nothing as real nothing is unthinkable. Real Nothing is ineffable. Is beyond languaging.
What most people call nothing is actually a something. Shoving stuff aside to make a hole isn’t nothing. That’s a nothing that is actually a something. Real nothing is unthinkable. Good luck creating a math around actual nothing and getting it to run on anything. Good luck even thinking about thinking about that.
What’s clear is the lack of clarity about what constitutes thinking. We’re devoting much resources and aiming at… Gee …, … hmmm …nothing?
We humans take ourselves much too seriously and not nearly seriously enough.
> The better one is at abstraction, the less search one has to do.
I think I understand this intuitively, but is the idea that being able to abstract somehow constrains the search space? e.g. "a new thing x looks like y which I've seen before, so I only have to search the smaller space related to y"?
Yes, or "I recognize that new thing x as being a kind of y", or "x is a y-situation". E.g., I'm driving and recognize the complicated scene in front of me as a "road construction" situation, which I have some prior knowledge how to deal with.
> The better one is at abstraction, the less search one has to do.
I have difficulty agreeting with this. I don’t think you do less search, you just do it faster, either because you are more efficient in searching, or because you managed to ”cache” a larger set of working heuristics in your memeory (which is another way of being more efficient)
It is not surprising that LLM cannot solve these challenges. LLM has a totally different architecture. It lacks any kind of spatial skills or reasoning abilities. LLM functions best in solving problems by imitation, and even for those it will likely get the details wrong without additional modeling or error checking.
This is a great post; your explanation really helped to distill down Greenblatt's post, which I found somewhat opaque.
On the point around comparing compute-heavy approaches to ARC with chess models, I generally agree with the sentiment, but if we assume that because of hardware, software, and model optimizations, the cost of compute for GPT-4 capable models continues to go down, the same results can be achieved with less.
While the model may still not be abstracting in the same way humans do, if this type of "learning" can be replicated across many abstract domains, doesn't that get to a place close to what people would label AGI? Presumably, it's not enough for reaching something where models are creating novel research, but it seems like this scaled up across many domains has the potential to make a model that's generally good at this type of problem-solving?
For me, the most important part is this: "when people focus on a specific target (e.g., top score on the ARC private evaluation set), they sometimes lose sight of what the benchmark is actually trying to measure (e.g., capacities for few-shot abstraction based on core knowledge concepts), and the methods developed to hit the target might miss the original motivation altogether."
The benchmark has "Abstract" in the title for a good reason. It is not supposed to be beaten through brute-force methods and sampling multiple answers, hoping one of them will work.
Pretty easy to avoid these pitfalls in my future evaluation projects.
1. Add $1million in prize money
2. Name it AGI
Hah. Nice piece!
In Chollet's defence, they had to gain popularity for the competition and the challenge, hence most likely the AGI naming and promotion as "a measure of intelligence". Chollet himself has pointed out that ARC is a necessary but not sufficient measure of intelligence and that it is most certainly not perfect as a test and so very likely possible to be solved through some sort of hacking or brute force. Their plan seems to be indeed to develop the test further and continue with the annual competitions.
> if ARC can be solved largely via brute-force domain-specific search rather than via more general and humanlike abstraction
I expect that >85% on the semi-private evaluation set of ARC will first be achieved by a near-SOTA multimodal LLMs using some domain specific search (perhaps similar to my approach, but possibly doing search over some object other than programs, e.g. english language descriptions of the rule or some DSL).
(That is, I expect this is how this happens if it happens in the next 4 years with around 50% probability.)
Then, some of the credit will go to the language model being better at determining the rule (good enough to get the hardest problems perhaps 1/1,000 times or 1/10,000 times) and some credit will go to the part doing searching and checking.
Probably at this point, the language model will be good enough that you'll be able to get it to get around maybe 30 to 50% of the questions right on one pass through using relatively general purpose methods (e.g. general purpose agent scaffolding like what would have to be used for this sort of benchmark: https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/).
Very good post, thank you. As an aside, Charles Goodhart and I are friendly; we discuss central banking from time to time, and he even blurbed one of my books, Out of Crisis: Rethinking Our Financial Markets. He will be delighted to see this use of Goodhart's Law. I will let him know. Keep up the good work!
Nice post, Mel! Out of interest, has anyone used the training set for Captchas?
"My interest in ARC is its supposed ability to test humanlike abstraction (and understanding of core knowledge concepts), and it’s not clear to me that this method actually does any abstraction. "
I think you can be much firmer in your position.
There is absolutely no abstraction shown at all by the creation of thousands of different sets of python codes. There is no reasoning by the LLM.
This approach was purely an exercise in what used to be called genetic engineering. Produce vast numbers of slightly different models and slowly develop something new that works.
However, it doesn't show reasoning by the LLM, only enormous amounts of prompt engineering and effort by a human to force the recalcitrant LLM into some weird actions. Pointless and a con.
> genetic engineering
Do you by any chance mean "genetic programming"? (ie, the field of AI where we create programs using evolution-inspired algorithms.) I think GP is a bit different from the data-augmentation + LLM approaches described above. But yes, typically GP does involve a lot of generate-and-test.
This is a fantastic and thought-provoking post tying together a lot of interesting ideas and theories. Thank you for compiling all this intel in one place. We seem to be quite a ways away from AGI. An important reminder that human methodology can influence testing as much as the bleeding edge of research can.
Yes, the contestants may well lose the actual point of intelligence, which is actually dealing with concepts. NNs still don't as far as I'm concerned. https://davidhsing.substack.com/p/llms-and-generative-ai-dont-deal
we’re going to be playing the game where, like in Scooby-Doo, we pull off the “reasoning” mask and reveal “tree search hardcoded into model architecture” for a while yet, I think.
> It turned out that Shlegeris and Greenblatt were confusing the public evaluation set with the private one.
We weren't confusing it exactly, though I was mistaken about the relative difficulty. (I was certainly aware that these are different sets.) As I noted with a footnote in the original post:
> Also, prior results are evaluated on a private test set, though this shouldn’t matter much since it’s supposed to be basically IID with the public test set.
I just assumed these sets were IID or practically IID as that is what the arc prize website seemed to imply. This has since been clarified.
Further context is here: https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50-sota-on-arc-agi-with-gpt-4o?commentId=XheMa2FNjjjdQGSqJ
Thanks for replying. My comment about "confusing the public and private evaluation sets" came from Buck's tweet on the results: "Test set: 51% vs prior SoTA of 34% (human baseline is unknown)" --of course the 34% was on the *private* set, so not the right comparison. Similarly on calling 50% "SoTA". But I see what you mean about assuming they should have approx. the same difficulty. I'm not sure that Chollet intended the public and private test set to be "IID" -- in any case, I'm not sure what IID would even mean for this challenge.
> in any case, I'm not sure what IID would even mean for this challenge.
If problems can be made highly diverse, then generating 600 problems via the same process and then randomly dividing between public/private should be fine. (And this is what I think benchmark creators should typically do: have a public split and private split which are IID.)
In some cases, it is hard/costly to make a large number of highly diverse problems (some problems would be too similar such that you are worried about effective contamination). In this case, one reasonable approach is to group problems into "families" of similar problems where each family is sufficiently distinct from the other families that you aren't worried about contamination, but the problems within the family would have too much overlap. Then, you randomly divide these families between public/private.
(There is probably a standard term for this task families thing in the statistics literature, but I'm not familiar enough to know it.)
One way to imagine this is to imagine that we randomly sample N problem families from the space of task families. But, if we only made one problem per family, our measurement of performance would be too high variance (or possibly we wouldn't have enough training data). So, we instead make many problems per family to reduce variance further as this is cheaper than making more totally distinct families.
Yep, I agree the tweet is incorrect (or at the very least, misleading). Sorry about that.
Goodhart's Law. Educators take note: it applies equally to the practice of teaching to the tests.
The word intelligent is misleading. What is being pursued is thinking. Thinking isn’t taking something that already exists and rearranging it in new or different or better ways. These are all examples of changes. Everything in the universe changes all the time. Rocks change. Given even an infinite of time rocks are not going to change into thinking rocks. No amount of time and change will produce thinking. Conflating change with thinking keeps this illusion of a pursuit going.
Now of course there’s nothing wrong with change. Change will go on with or without humanity. On a human scale change is required.
What all the hubris about imitating an act of creation comes down to is recognition that change and creation are distinct. Creation comes from nothing while change is a rearrangement of something that already exists, that already was brought into existence.
Computing and the math behind it are like clever arrangements of falling dominos. You can learn much about the created by playing with it. What you can never learn from playing with it is how it came to be in the first place.
The problem of thinking about thinking is that it’s not about change but rather about bringing forth from nothing.
The problem with thinking about nothing is that nothing as real nothing is unthinkable. Real Nothing is ineffable. Is beyond languaging.
What most people call nothing is actually a something. Shoving stuff aside to make a hole isn’t nothing. That’s a nothing that is actually a something. Real nothing is unthinkable. Good luck creating a math around actual nothing and getting it to run on anything. Good luck even thinking about thinking about that.
What’s clear is the lack of clarity about what constitutes thinking. We’re devoting much resources and aiming at… Gee …, … hmmm …nothing?
We humans take ourselves much too seriously and not nearly seriously enough.
> The better one is at abstraction, the less search one has to do.
I think I understand this intuitively, but is the idea that being able to abstract somehow constrains the search space? e.g. "a new thing x looks like y which I've seen before, so I only have to search the smaller space related to y"?
Yes, or "I recognize that new thing x as being a kind of y", or "x is a y-situation". E.g., I'm driving and recognize the complicated scene in front of me as a "road construction" situation, which I have some prior knowledge how to deal with.
> The better one is at abstraction, the less search one has to do.
I have difficulty agreeting with this. I don’t think you do less search, you just do it faster, either because you are more efficient in searching, or because you managed to ”cache” a larger set of working heuristics in your memeory (which is another way of being more efficient)
It is not surprising that LLM cannot solve these challenges. LLM has a totally different architecture. It lacks any kind of spatial skills or reasoning abilities. LLM functions best in solving problems by imitation, and even for those it will likely get the details wrong without additional modeling or error checking.
This is a great post; your explanation really helped to distill down Greenblatt's post, which I found somewhat opaque.
On the point around comparing compute-heavy approaches to ARC with chess models, I generally agree with the sentiment, but if we assume that because of hardware, software, and model optimizations, the cost of compute for GPT-4 capable models continues to go down, the same results can be achieved with less.
While the model may still not be abstracting in the same way humans do, if this type of "learning" can be replicated across many abstract domains, doesn't that get to a place close to what people would label AGI? Presumably, it's not enough for reaching something where models are creating novel research, but it seems like this scaled up across many domains has the potential to make a model that's generally good at this type of problem-solving?