On Evaluating Understanding and Generalization in the ARC Domain

May 15, 2023

In a previous post I wrote about the Abstraction and Reasoning Corpus (ARC), an idealized domain created by François Chollet for evaluating abstraction and analogy abilities in machines and humans. My collaborators—Arseny Moskvichev and Victor Odouard—and I have been working on approaches to ARC, and have just put out a paper on some work we have done. (If you aren’t already familiar with ARC, I recommend reading the previous post before reading this one.)

The motivation of this work was to address two issues we found with the original ARC dataset. First, the tasks can be quite difficult, even for humans, and indeed might be too difficult to reveal real progress on abstraction and reasoning in machines. Second, while the tasks were designed to focus on humans’ “core knowledge” (e.g., objects and their interactions, and basic spatial and numerical concepts) they did not systematically test for understanding of these basic concepts. That is, if a program (or human) could solve the task shown below, it’s not obvious that it “understands” in a deeper way basic concepts of numerosity or counting.

It’s possible that the system is using some other strategy to solve this particular task, one that might not generalize to other counting tasks. Indeed, machine learning systems are well known to use “shortcuts” to solve tasks, and such shortcuts typically result in failure to generalize to variations on those tasks.

In a post on the Connectionists mailing list, the computer scientist Tom Dietterich made a useful a distinction between “pointwise” understanding—“providing appropriate responses to individual queries”—and “systematic” understanding—“providing appropriate responses across an entire range of queries or situations.” Dietterich noted that “When people complain that an AI system doesn’t ‘truly’ understand, I think they are often saying that while the system can correctly handle many questions/contexts, it fails on very similar questions/contexts. Such a system cannot be trusted to produce the right behavior, in general.”

Or, as François Chollet put more succinctly in a tweet:

Our paper describes a new benchmark we created in the ARC domain that assesses such systematic understanding of concepts. This benchmark, called ConceptARC, consists of ARC tasks specifically created to instantiate particular concepts. In particular, we chose 16 concepts that appear in many of the original ARC tasks, and for each concept we created new tasks that specifically focus on that concept. For example, the picture below illustrates three tasks that are focused on the concept “sameness”.

In (a) the rule is to preserve objects of the same shape; in (b) the rule is to preserve lines of the same orientation, and in (c) each grid is divided into two halves (separated by a gray line); if the halves are identical, the entire grid is reproduced, otherwise only the left half is reproduced. These examples give the flavor of the kinds of variations on a given concept that can be made. The ARC domain is open-ended enough that there are an infinite number of possible variations that could be made on any given concept, but to either generate or solve a range of interesting and diverse variations, one needs to understand the concept in a humanlike way. Thus, like Chollet with ARC, we manually created all the tasks in this benchmark. The 16 concepts we chose for this benchmark are

Above and Below
Center
Clean Up
Complete Shape
Copy
Count
Extend to Boundary
Extract Objects
Filled and Not Filled
Horizontal and Vertical
Inside and Outside
Move to Boundary
Order
Same and Different
Top and Bottom 2D
Top and Bottom 3D

For each of these concepts we created 10 new tasks, each of which had three different test inputs, for a total of 30 test inputs per “concept group”.

Using the crowdsourcing platforms Amazon Mechanical Turk and Prolific, we tested people on the tasks in this benchmark and looked at their accuracy over the variations in a given concept group. Overall, people did very well: on the majority of concepts, over 90% of people solved all of the tests, showing (not surprisingly) that they could generalize well over different variations of a given concept.

We also tested three programs on these tasks: the two highest-scoring programs from the ARC-Kaggle competition (described in my previous post), and OpenAI’s GPT-4 (the publicly available language-only version). The two ARC-Kaggle programs were given text-based descriptions of tasks as input, and GPT-4 was given prompts adapted from these text-based descriptions. (Note that while GPT-4 wasn’t specifically designed or trained for ARC tasks, Webb et al. showed that large language models can solve some problems from idealized analogy domains without any such training.)

None of these programs perform at anywhere near the level of humans on these tasks, and even when they do solve a particular task in a concept group, they typically cannot generalize well to other tasks instantiating the same concept. In Dietterich’s words, they have a degree of “pointwise understanding” but they lack the kind of systematic understanding required for robust generalization.

It’s important to point out, however, that when solving a task in the ARC domain, a humans brings not only their core knowledge about the world but also their highly evolved visual system. The latter is lacking in in the various programs we tested. It will be interesting to test the multimodal version of GPT-4 on ARC tasks, once it is made publicly available, but I predict that, while it might perform better on “pointwise understanding”, it still won’t have the combination of visual routines and robust core concepts necessary to solve these tasks in any general way.

If all this has sparked your interest, please take a look at our paper, which gives links to our new dataset of tasks as well as to all the data we collected from humans and machines.

AnnB2

As a very old human, I propose that the ultimate role of any machine is to be a customizable assistant. PDAs, for example, were widely adapted because they allowed the end user to enter and track notes, details of a meeting, or contacts without the need for a secretary. At Air Products and Chemicals in the mid-1980s managers and engineers who had no interest in computing became enthusiastic users of email because it provided a time-saving alternative to phone calls and meetings. An app powered by AI or a "dog" that can go where humans cannot will have as many uses as the people who direct them. On the other hand, our Roomba regularly gets into a spot where it has to call for help.

Expand full comment

jazzbox35

The single most common theme related to "understanding" debate in AI is the idea of levels of understanding. Some people tend to think of understanding as a kind of absolute.

5 more comments...

AI: A Guide for Thinking Humans

Discussion about this post