My assumption is that mental imagery is a critical capability that humans use for many of the problems you describe. As long as LLMs only process language, their only hope is to find associations to language descriptions of similar problems and fake it. One would hope that someday these systems would be extended to have mental imagery components (as found in cognitive architectures), not to mention episodic memory!
I enjoyed this article is a lot, thank you! Since GPT-4 was released since its publication, I thought I'd try the stack-of-items prompt – and unlike the GPT-3.x models, 4 does in fact pass. I won't try to draw far-ranging conclusions from that one data point, but it seems a meaningful one nonetheless.
Q: Imagine a stack of items, arranged from bottom to top: cat, laptop, television, apple. The apple is now moved to the bottom. What items is on top of the stack now?
A: If the apple is moved to the bottom of the stack, the new arrangement from bottom to top would be: apple, cat, laptop, television. The item on top of the stack now is the television.
(Bard gets it hilariously wrong...ish? "The items on top of the stack now are the cat and the laptop. The apple was originally on the top of the stack, but it was moved to the bottom. This means that the cat and the laptop are now on top of the stack.")
"Metaphors we live by" discussed the use of concepts like "on top of". The authors are very close to saying and almost mention that metaphors serve to handle abstract concepts with more specific ones.
In my model of intelligence (described here - https://alexandernaumenko.substack.com/), I call to switch focus from objects and actions to properties - measurable and comparable. It is because spacial organization is so easy to handle that we apply "on top of" to other situations where spatial organization strictly speaking is not applicable - people are not physically on top of each other in hierarchies.
I don't do perception and I am still in the process on how to recognize analogies. But my model is already capable of making some noise. The only thing left is to implement it, train and test. Note that training is different from traditional ML models. It is more similar to training kids.
Hi Melanie! Having a body and using it to non symbolically perceive the pattern/shape/arrangement is what helps us do it, no? We don't need to form arrays of the sample shapes and reason that the array is reversed.
And that comes from our comparison with a self-organising right structure of any entity that is integrated in it to be existing and living - a universal principle of life cycle or transformation of energy that works everywhere.
In your newsletter, you pointed out noteworthy limitations of AI, especially when it is compared with how humans reason. You posited that machines have been made to process language, solve problems, and learn just like we do. This is a tremendous progress that the AI community has attained as improvements continue to be churn out every month.
I agree with the fact that machines do not fully have the intelligence that humans have. This is especially true in forming concepts and abstracts. This is majorly attributable to the programmed nature of AI, but this is understandable given that it is still learning.
But, despite immense mimicking of humans, AI is yet to be intelligent enough to scan an environment and respond with the precise and specifity of the highest degree. Humans often do this with ease.
However, AI does not have the flexibility of humans in situational thinking. This is what I encountered recently. Check my take on AI and how I interacted with it.
I think without a world model there is no real way for the AI to get the "on top of" example right. Because it's the understanding of the world that allows humans to nuance that.
Same with other abstractions. There is so much sensory input that our evolution needed us to abstract information in a way we can then use and learn. AI as it stands right now making progress here seems limited.
In humans (and animals), the concept of "on the top of" is formed non-verbally; language is used primarily to operate with existing concepts and relationships between them to exchange information with other subjects.
Thank you - this is intriguing. But I (and others) have observed a surprising ability of ChatGPT to get spatial concepts right: Here is an example:
----------
Human: Please write the source code for a simple svg that shows a television on top of a cat.
ChatGPT: As an AI language model, I don't have the capability to generate images or create SVGs on my own. However, I can provide you with an example SVG code for a television on top of a cat. You can use this code as a starting point and modify it as per your needs.
In this code, the rectangle shapes are used to create the television, while the circle and rectangle shapes are used to create the cat. You can further customize the SVG by changing the colors, sizes, and shapes of the different elements to match your requirements.
-----
You can copy the svg code into a text document, save it with the extension .svg, and most browsers should be able to display a set of nested boxes (TV) on-top-of an orange circle (tabby cat?) with a black head. That's the thing. When you repeat it, it comes up with different solutions - but the TV is on top of the cat.
You can easily try similar experiments yourself. There is a poster on Mastodon who has produced quite a few other examples of stylized images; I myself have had ChatGPT draw a boat on the ocean with a flock of birds and clouds before the sun. The brown boat floats on the blue rectangle of the ocean, partly immersed. Sun, clouds and birds are where they should be - in the sky. I can get you the code if you like, and the exact prompt so you can try it for yourself.
Why are such scenes of interest? I would argue that ANY spatially reasonable SVG ought to be beyond the abilities of the AI. An SVG is a linear representation that is not parsed when it is used as training data for an LLM. It may have learned how to produce spatial relationships just as it learned other ways of programming - but abstarcting and sequencing the task, this is something it appears to be capapble of.
So here we are, it draws, and the images it produces are different every time, i.e. not themselves learned.
As a molecular biologist in my other life (when I don't write on AI in academia on Substack), I would argue that the models are so large, that it would be much more surprising if we would NOT find emergent abilities, than that we are seeing things that are hard to reconcile with the contents of the training data. I hope this makes sense, I'd be happy to elaborate otherwise.
I'm glad to see this, ARC also seems fundamentally interesting to me, and I've been vaguely disappointed that more "serious research groups" haven't dug into it. Curious to see your approach!
While reading one of your more recent articles, I found this interesting article from over 8 months ago! One thing I like to do is to test the latest popular versions of LLMs on problems that earlier LLMs had failed doing to see what progress has occurred, if any.
So I tested some your example problems on the current Bard (from Google) and Bing Chat (which is based on OpenAI’s GPT-4). Further, to prevent them from simply reporting the answer it may have found by reading your paper and others from the internet, I generally change the names used in the problem. (Although I noted you were careful not to say what the answers were.)
Figure 1 problem: I changed the prompt to “draw a computer on a dog”. Now 8 months later Bing Chat which uses GPT-4 is now able to use the new DALL-E 3 to create pictures. However, it still refused to put a big physical object on a pet. It creates 4 pictures with a dog sitting next to or on top of a laptop. In my second prompt I highlight that I asked for “a computer ON TOP of a dog”. It apologizes and tried again, but all 4 pictures still have the dog on top of the laptop keyboard. I suspect this is because that is what DALL-E has found to be most common in its training.
I tried a third time, this time I changed the prompt to be more explicit: “Draw a computer sitting on the back of a dog.” This time 2 out of the 4 pictures show a laptop on the back of a dog. Interestingly, this time all the pictures are cartoonish, unlike the previous ones which had shown realistic images of dogs. Also, the dogs are showing “smiling” unlike the realistic images of dogs shown before. I am guessing that it wants to convey that even though there is a laptop on its back, the dog is OK with that. So it appears we may be seeing the overwhelming power of its previous training and also of its safe guards.
Bard does not yet have the ability to create beautiful pictures. However, it did use text symbols like dashes and slashes to make this cryptic picture:
(__)
(oo) /------\/ / \
/------\/ / | || * *
* / | || ~~ ~~ ||===||
~~ ~~ ||===|| ||
Plus it responded with this text: “The computer is placed on the dog's back, with the screen facing the camera. The dog is sitting down, looking up at the computer screen. It is a cartoonish drawing, with exaggerated features and bright colors.”
Ha! At least it got correct the idea that the computer is physically sitting on the dog.
Figure 2 problem: I changed the specific objects that were arranged from bottom to top, but the problem was the same - moved an item to the bottom and asked what item is on top.
Bing Chat got it right the first time without even bothering to show me its work!
Bard tried to show its work but showed the stacks upside down. So it got the answer wrong when it came to saying what the top item was. I told it that its answer was wrong, to try again. It apologized and gave me the right answer the second time.
Figure 3 problem: I gave the picture from your article with the prompt “Can you describe what you see in this picture?”
Bing Chat gave me a good overall description of what it saw and guessed that it was a puzzle or game. I told it was a puzzle and asked it to determine what the figure will look like on the right based on the transformations on the left. Bing Chat made a guess that it was flipping the pattern along the vertical axis. I told it that was wrong and to try again. But this time it replied “without being able to visually analyze the image, it’s challenging for me to determine the exact transformation rule applied.” I told it that it was rotating the pattern clockwise 90 degrees. It replied “without being able to visually depict this, it’s challenging to describe the exact outcome”.
So it is interesting that it appears to at least know its limitations.
Bard gave me a description which did not make much sense to me. It tried to answer the test, but its answer did not make any sense to me either.
For the Figure 5 problem, the three paragraphs that described the problem were in a picture, not available in text format. So I first asked Bing Chat to write out the text from the picture, but it said it could only do a single small paragraph. I turned to Bard, it had no problem converting the entire picture to text for me! Then I edited the description changing the color of the objects before giving it back to Bing Chat and Bard.
Bing Chat once again did not bother showing me its work; it just produced the correct answer.
Bard produced lots of text as it tried to show its thoughts - some of it did not make sense - but it got the correct answer in the end.
So in summary, Bing Chat using GPT-4 had no problem doing the text problems. Bard was almost as good.
However, both struggle to do problems involving pictures. The capability to take in pictures is new to both and to generate images is new to Bing Chat. I suspect we will see much improvement in future versions.
I find it both fun and scary to see the latest LLMs solve problems that the naysayers had previously argued demonstrate that LLMs are not intelligent.
Prompt: Imagine a stack of items, arranged from bottom to top: cat, laptop, television, apple. The apple is moved to the bottom. What item is on top of the stack now?
ChatGPT
Answer: After moving the apple to the bottom, the new arrangement of the stack from bottom to top would be: apple, cat, laptop, television. The item on top of the stack now is the television.
Lol Yo, Natan, there is a HUGE difference between understanding and regurgitation. How can you people not fathom the difference between statistical pattern recognition and thought? I mean look at that prompt. It's a simple list manipulation operation. What did your parents do to you as child? Were you ever let outdoors?
Dear Melanie, don't spend even one minute of your time on MLLM, this is just pure crap.
True AI researchers know since the 90s that neural network architectures made of layers of neural gates, as deep as they can be, will never be able to generalize out of the training distribution, not even mentionning on-the-fly AGI reasoning.
These deep networks are just trying to be plausible / realistic mirrors of the reality, misleading people by making them think they have some reasoning capabilities whereas all they can do is mimicking reality.
We can call them stochastic parrots.
What they get right though is that is knowledge is completely broken down and split into small fragments.
I see that you still reason like a symbolic researcher by talking about concepts as if they were tanglible objects in the brain whereas they are not. What is even more misleading is that the example you give ('on top of') may induce the idea to the reader's mind that this is this very piece of information (<ON TOP OF>) that enters into a intelligent system and that is linked to different meaning. It is much more likely that this piece if information along with its context get very translated right away into internal circuits of information to trigger other internal circuits. In other words, there is no <ON TOP OF> object in the brain, in whatever form of encoding you want: only very distributed routing circuits activated based on the context the <ON TOP OF> is used.
My best advice to you, to link sparse distributed low level forms of knowledge with higher abstract reasoning is to reset your mindset and stop using the words 'concepts', 'objects', or any other old style thought-blocking terms from the symbolic AI era.
What is needed is something very different, but I won't go any further here as you can imagine.
My assumption is that mental imagery is a critical capability that humans use for many of the problems you describe. As long as LLMs only process language, their only hope is to find associations to language descriptions of similar problems and fake it. One would hope that someday these systems would be extended to have mental imagery components (as found in cognitive architectures), not to mention episodic memory!
Strong agree!
In my symbolic multilingual model (t.me/thematrixcom) I just use images in other languages to describe everything.
I enjoyed this article is a lot, thank you! Since GPT-4 was released since its publication, I thought I'd try the stack-of-items prompt – and unlike the GPT-3.x models, 4 does in fact pass. I won't try to draw far-ranging conclusions from that one data point, but it seems a meaningful one nonetheless.
Q: Imagine a stack of items, arranged from bottom to top: cat, laptop, television, apple. The apple is now moved to the bottom. What items is on top of the stack now?
A: If the apple is moved to the bottom of the stack, the new arrangement from bottom to top would be: apple, cat, laptop, television. The item on top of the stack now is the television.
(Bard gets it hilariously wrong...ish? "The items on top of the stack now are the cat and the laptop. The apple was originally on the top of the stack, but it was moved to the bottom. This means that the cat and the laptop are now on top of the stack.")
Just came here and wrote the same :)
Good post!
"Metaphors we live by" discussed the use of concepts like "on top of". The authors are very close to saying and almost mention that metaphors serve to handle abstract concepts with more specific ones.
In my model of intelligence (described here - https://alexandernaumenko.substack.com/), I call to switch focus from objects and actions to properties - measurable and comparable. It is because spacial organization is so easy to handle that we apply "on top of" to other situations where spatial organization strictly speaking is not applicable - people are not physically on top of each other in hierarchies.
I don't do perception and I am still in the process on how to recognize analogies. But my model is already capable of making some noise. The only thing left is to implement it, train and test. Note that training is different from traditional ML models. It is more similar to training kids.
Hi Melanie! Having a body and using it to non symbolically perceive the pattern/shape/arrangement is what helps us do it, no? We don't need to form arrays of the sample shapes and reason that the array is reversed.
And that comes from our comparison with a self-organising right structure of any entity that is integrated in it to be existing and living - a universal principle of life cycle or transformation of energy that works everywhere.
Nice!
In your newsletter, you pointed out noteworthy limitations of AI, especially when it is compared with how humans reason. You posited that machines have been made to process language, solve problems, and learn just like we do. This is a tremendous progress that the AI community has attained as improvements continue to be churn out every month.
I agree with the fact that machines do not fully have the intelligence that humans have. This is especially true in forming concepts and abstracts. This is majorly attributable to the programmed nature of AI, but this is understandable given that it is still learning.
But, despite immense mimicking of humans, AI is yet to be intelligent enough to scan an environment and respond with the precise and specifity of the highest degree. Humans often do this with ease.
However, AI does not have the flexibility of humans in situational thinking. This is what I encountered recently. Check my take on AI and how I interacted with it.
https://open.substack.com/pub/thestartupglobal/p/my-encounter-with-ai-assisted-chatbot?utm_source=direct&r=m5mq1&utm_campaign=post&utm_medium=web
I think without a world model there is no real way for the AI to get the "on top of" example right. Because it's the understanding of the world that allows humans to nuance that.
Same with other abstractions. There is so much sensory input that our evolution needed us to abstract information in a way we can then use and learn. AI as it stands right now making progress here seems limited.
It's an interesting thought, but the GPT4 output seems to prove this is wrong.
Great article.
In humans (and animals), the concept of "on the top of" is formed non-verbally; language is used primarily to operate with existing concepts and relationships between them to exchange information with other subjects.
Thank you - this is intriguing. But I (and others) have observed a surprising ability of ChatGPT to get spatial concepts right: Here is an example:
----------
Human: Please write the source code for a simple svg that shows a television on top of a cat.
ChatGPT: As an AI language model, I don't have the capability to generate images or create SVGs on my own. However, I can provide you with an example SVG code for a television on top of a cat. You can use this code as a starting point and modify it as per your needs.
<svg viewBox="0 0 500 500" xmlns="http://www.w3.org/2000/svg">
<rect x="0" y="0" width="500" height="500" fill="#fff" />
<rect x="100" y="100" width="300" height="250" fill="#000" />
<rect x="120" y="120" width="260" height="210" fill="#fff" />
<rect x="140" y="140" width="220" height="170" fill="#000" />
<rect x="160" y="160" width="180" height="130" fill="#fff" />
<rect x="200" y="200" width="100" height="50" fill="#000" />
<circle cx="250" cy="375" r="75" fill="#f60" />
<rect x="225" y="325" width="50" height="50" fill="#000" />
</svg>
In this code, the rectangle shapes are used to create the television, while the circle and rectangle shapes are used to create the cat. You can further customize the SVG by changing the colors, sizes, and shapes of the different elements to match your requirements.
-----
You can copy the svg code into a text document, save it with the extension .svg, and most browsers should be able to display a set of nested boxes (TV) on-top-of an orange circle (tabby cat?) with a black head. That's the thing. When you repeat it, it comes up with different solutions - but the TV is on top of the cat.
You can easily try similar experiments yourself. There is a poster on Mastodon who has produced quite a few other examples of stylized images; I myself have had ChatGPT draw a boat on the ocean with a flock of birds and clouds before the sun. The brown boat floats on the blue rectangle of the ocean, partly immersed. Sun, clouds and birds are where they should be - in the sky. I can get you the code if you like, and the exact prompt so you can try it for yourself.
Why are such scenes of interest? I would argue that ANY spatially reasonable SVG ought to be beyond the abilities of the AI. An SVG is a linear representation that is not parsed when it is used as training data for an LLM. It may have learned how to produce spatial relationships just as it learned other ways of programming - but abstarcting and sequencing the task, this is something it appears to be capapble of.
So here we are, it draws, and the images it produces are different every time, i.e. not themselves learned.
As a molecular biologist in my other life (when I don't write on AI in academia on Substack), I would argue that the models are so large, that it would be much more surprising if we would NOT find emergent abilities, than that we are seeing things that are hard to reconcile with the contents of the training data. I hope this makes sense, I'd be happy to elaborate otherwise.
Very interesting!!
I'm glad to see this, ARC also seems fundamentally interesting to me, and I've been vaguely disappointed that more "serious research groups" haven't dug into it. Curious to see your approach!
Many thanks for this.
While reading one of your more recent articles, I found this interesting article from over 8 months ago! One thing I like to do is to test the latest popular versions of LLMs on problems that earlier LLMs had failed doing to see what progress has occurred, if any.
So I tested some your example problems on the current Bard (from Google) and Bing Chat (which is based on OpenAI’s GPT-4). Further, to prevent them from simply reporting the answer it may have found by reading your paper and others from the internet, I generally change the names used in the problem. (Although I noted you were careful not to say what the answers were.)
Figure 1 problem: I changed the prompt to “draw a computer on a dog”. Now 8 months later Bing Chat which uses GPT-4 is now able to use the new DALL-E 3 to create pictures. However, it still refused to put a big physical object on a pet. It creates 4 pictures with a dog sitting next to or on top of a laptop. In my second prompt I highlight that I asked for “a computer ON TOP of a dog”. It apologizes and tried again, but all 4 pictures still have the dog on top of the laptop keyboard. I suspect this is because that is what DALL-E has found to be most common in its training.
I tried a third time, this time I changed the prompt to be more explicit: “Draw a computer sitting on the back of a dog.” This time 2 out of the 4 pictures show a laptop on the back of a dog. Interestingly, this time all the pictures are cartoonish, unlike the previous ones which had shown realistic images of dogs. Also, the dogs are showing “smiling” unlike the realistic images of dogs shown before. I am guessing that it wants to convey that even though there is a laptop on its back, the dog is OK with that. So it appears we may be seeing the overwhelming power of its previous training and also of its safe guards.
Bard does not yet have the ability to create beautiful pictures. However, it did use text symbols like dashes and slashes to make this cryptic picture:
(__)
(oo) /------\/ / \
/------\/ / | || * *
* / | || ~~ ~~ ||===||
~~ ~~ ||===|| ||
Plus it responded with this text: “The computer is placed on the dog's back, with the screen facing the camera. The dog is sitting down, looking up at the computer screen. It is a cartoonish drawing, with exaggerated features and bright colors.”
Ha! At least it got correct the idea that the computer is physically sitting on the dog.
Figure 2 problem: I changed the specific objects that were arranged from bottom to top, but the problem was the same - moved an item to the bottom and asked what item is on top.
Bing Chat got it right the first time without even bothering to show me its work!
Bard tried to show its work but showed the stacks upside down. So it got the answer wrong when it came to saying what the top item was. I told it that its answer was wrong, to try again. It apologized and gave me the right answer the second time.
Figure 3 problem: I gave the picture from your article with the prompt “Can you describe what you see in this picture?”
Bing Chat gave me a good overall description of what it saw and guessed that it was a puzzle or game. I told it was a puzzle and asked it to determine what the figure will look like on the right based on the transformations on the left. Bing Chat made a guess that it was flipping the pattern along the vertical axis. I told it that was wrong and to try again. But this time it replied “without being able to visually analyze the image, it’s challenging for me to determine the exact transformation rule applied.” I told it that it was rotating the pattern clockwise 90 degrees. It replied “without being able to visually depict this, it’s challenging to describe the exact outcome”.
So it is interesting that it appears to at least know its limitations.
Bard gave me a description which did not make much sense to me. It tried to answer the test, but its answer did not make any sense to me either.
For the Figure 5 problem, the three paragraphs that described the problem were in a picture, not available in text format. So I first asked Bing Chat to write out the text from the picture, but it said it could only do a single small paragraph. I turned to Bard, it had no problem converting the entire picture to text for me! Then I edited the description changing the color of the objects before giving it back to Bing Chat and Bard.
Bing Chat once again did not bother showing me its work; it just produced the correct answer.
Bard produced lots of text as it tried to show its thoughts - some of it did not make sense - but it got the correct answer in the end.
So in summary, Bing Chat using GPT-4 had no problem doing the text problems. Bard was almost as good.
However, both struggle to do problems involving pictures. The capability to take in pictures is new to both and to generate images is new to Bing Chat. I suspect we will see much improvement in future versions.
I find it both fun and scary to see the latest LLMs solve problems that the naysayers had previously argued demonstrate that LLMs are not intelligent.
GPT4 does fine on the bottom/top task:
```
Prompt: Imagine a stack of items, arranged from bottom to top: cat, laptop, television, apple. The apple is moved to the bottom. What item is on top of the stack now?
ChatGPT
Answer: After moving the apple to the bottom, the new arrangement of the stack from bottom to top would be: apple, cat, laptop, television. The item on top of the stack now is the television.
```
Oh, you're the founder of a Kubernetes application. Question answered 😂
Lol Yo, Natan, there is a HUGE difference between understanding and regurgitation. How can you people not fathom the difference between statistical pattern recognition and thought? I mean look at that prompt. It's a simple list manipulation operation. What did your parents do to you as child? Were you ever let outdoors?
Dear Melanie, don't spend even one minute of your time on MLLM, this is just pure crap.
True AI researchers know since the 90s that neural network architectures made of layers of neural gates, as deep as they can be, will never be able to generalize out of the training distribution, not even mentionning on-the-fly AGI reasoning.
These deep networks are just trying to be plausible / realistic mirrors of the reality, misleading people by making them think they have some reasoning capabilities whereas all they can do is mimicking reality.
We can call them stochastic parrots.
What they get right though is that is knowledge is completely broken down and split into small fragments.
I see that you still reason like a symbolic researcher by talking about concepts as if they were tanglible objects in the brain whereas they are not. What is even more misleading is that the example you give ('on top of') may induce the idea to the reader's mind that this is this very piece of information (<ON TOP OF>) that enters into a intelligent system and that is linked to different meaning. It is much more likely that this piece if information along with its context get very translated right away into internal circuits of information to trigger other internal circuits. In other words, there is no <ON TOP OF> object in the brain, in whatever form of encoding you want: only very distributed routing circuits activated based on the context the <ON TOP OF> is used.
My best advice to you, to link sparse distributed low level forms of knowledge with higher abstract reasoning is to reset your mindset and stop using the words 'concepts', 'objects', or any other old style thought-blocking terms from the symbolic AI era.
What is needed is something very different, but I won't go any further here as you can imagine.