"I think this kind of debate is actually really good for the science of LLMs"
I would agree with that. It is interesting to me that so many clever people are so uncritical about something so important to them. I might put that down to a steady diet of sci-fi and computer code, rather than a deep education about cognition, and indeed the hype cycle that's been used to fund this research and monetize the results. We're probably all guilty of that at some level I think, so perhaps it's best not to throw stones.
The really interesting issue to me is that it turns out the semantics captured by language are so amenable to an analysis just of syntax using statistical patterns. We seem to have found the scale at which this starts to happen, which is very large but not infinite. Yet this is not so surprising I think. The human mind is prodigious, but it's not infinite either.
But I think our minds are doing more than predicting tokens given a massive set of examples. Funnily enough, the question what we are really doing when we think and talk still remains fundamentally unanswered, even if we now know that we do these things using a tractable tool.
It's like when you didn't study for a test the night before so you try to answer multiple choice questions by other tricks. For example, test makers usually put more thought into the wording of the right answer than the wrong ones. Humans are very good at this kind of reasoning. I expect LLMs are even better at it.
Excellent article on one of the key limitations of LLMs (reasoning). The other (IMO) is the extremely shallow internal world model (required for genuine understanding of the real world) that is constructed by the LLM training process. Unless both of these problems (reasoning and understanding) can be robustly resolved, LLM cognition, and therefore the cognition of any agent or robot built on top of it, will be severely limited. It is extremely unlikely (IMO) that any LLM-based system will ever resolve these fundamental problems to the extent required for human-level AGI.
Reasoning cannot be performed in one shot, just as one can't write code in one shot.
There has to be an iterative process. At each step a hypothesis is made. Some validation takes place. Another step takes place. At some point one can realize that this is a dead-end and have to start anew.
A reasoning agent has to have some very good understanding of the environment it is searching. It is not unlike trying to find a treasure in a labyrinth without punching through walls.
It is plausible that a tool like GPT-o1 can be improved enough to reliably work through a problem, with the addition of a formal verifier, or strategies for grounding, checking work, reflection and evaluation.
This will require a lot of domain-specific modeling and lots of examples illustrating the relevant strategies.
o1 Preview didn't have an trouble working through the Kiwi problem with the distractors when I tried it or the toy problem symbolically.
"Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?"
ChatGPT
Thought for 5 seconds
To determine the total number of kiwis Oliver has, we'll calculate the number he picked each day and then sum them up.
Friday: Oliver picked 44 kiwis.
Saturday: He picked 58 kiwis.
Sunday:
He picked double the number he did on Friday.
Number of kiwis picked on Sunday:
2
×
44
=
88
2×44=88 kiwis.
Note: Five of these kiwis were a bit smaller than average, but unless specified otherwise, they are still counted as kiwis.
Total kiwis:
44
(
Friday
)
+
58
(
Saturday
)
+
88
(
Sunday
)
=
190
kiwis
44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis
Answer: 190
To solve this problem symbolically, let’s represent each variable in a general equation to find how many bouncy balls were added. Here’s the step-by-step approach:
Given:
The bag of building blocks contains {x} blocks.
The bin of stuffed animals contains {y} stuffed animals.
The tower of stacking rings contains {z} rings.
The total number of toys after adding the bouncy balls is {total}.
Define the variables:
Let x be the number of blocks.
Let y be the number of stuffed animals.
Let z be the number of rings.
Let b be the number of bouncy balls.
Let total be the total number of toys after adding the bouncy balls.
Write the equation: The sum of the initial number of toys plus the number of bouncy balls is equal to the total number of toys:
𝑥
+
𝑦
+
𝑧
+
𝑏
=
total
x+y+z+b=total
Rearrange the equation to solve for b (the number of bouncy balls):
𝑏
=
total
−
(
𝑥
+
𝑦
+
𝑧
)
b=total−(x+y+z)
Summary of the solution:
Equation:
𝑏
=
total
−
(
𝑥
+
𝑦
+
𝑧
)
b=total−(x+y+z)
To find the number of bouncy balls, subtract the sum of {x}, {y}, and {z} from {total}.
This symbolic equation gives you the number of bouncy balls required to reach the total number of toys.
The point is that, yes, these models can solve *some* of these problems, but their reliability is lower when given such irrelevant information, as shown in the paper.
The latest version of 3.5 Sonnet which might be the best model, right now, has no difficulty on any of the problems that were in the body of the paper you displayed.
For me, it makes sense to evaluate criteria against the best models.
Examples:
Let me solve this step by step.
Friday's kiwis:
Oliver picked 44 kiwis
Saturday's kiwis:
Oliver picked 58 kiwis
Sunday's kiwis:
Double Friday's amount: 44 × 2 = 88 kiwis
All 88 kiwis count, even if 5 were smaller than average
Total kiwis:
Friday: 44
Saturday: 58
Sunday: 88
Total = 44 + 58 + 88 = 190 kiwis
BM
To make a call from a hotel room phone, you must pay $0.6 for each minute of your call. After 10 minutes, the price drops to $0.5 per minute. After 25 minutes from the start of the call, the price drops even more to $0.3 per minute. If your total bill is more than $10, you get a 25% discount. How much would a 60-minute call cost
Let me solve this step by step.
First 10 minutes:
10 minutes × $0.6/minute = $6.00
Minutes 11-25 (15 minutes):
15 minutes × $0.5/minute = $7.50
Minutes 26-60 (35 minutes):
35 minutes × $0.3/minute = $10.50
Total before discount:
$6.00 + $7.50 + $10.50 = $24.00
Since the total is more than $10, apply 25% discount:
25% of $24.00 = $6.00
Final cost = $24.00 - $6.00 = $18.00
Therefore, a 60-minute call would cost $18.00.
When {name} watches her {family}, she gets out a variety of toys for him. The bag of building blocks has {x} blocks in it. The bin of stuffed animals has {y} stuffed animals inside. The tower of stacking rings has {z} multicolored rings on it. {name} recently bought a tube of bouncy balls, bringing her total number of toys she bought for her {family} up to {total}. How many bouncy balls came in the tube?
Let me help you solve this algebraic word problem.
To find the number of bouncy balls:
First, let's add up the known toys:
Building blocks: {x}
Stuffed animals: {y}
Stacking rings: {z}
Total of known toys = {x} + {y} + {z}
We know that when we add the bouncy balls:
Known toys + bouncy balls = {total}
{x} + {y} + {z} + bouncy balls = {total}
Therefore:
Bouncy balls = {total} - ({x} + {y} + {z})
Since this is a template with variables, we'd need the specific numbers to calculate the final answer. Would you like me to help solve it with actual numbers?
First things first: I loved the article and I learned a lot in reading it.
I feel in the rush to understand the suddenly-ubiquitous LLM, we may be under-estimating the capacities of non-LLM AI. Or even under-estimating familiar compute functions that we've known for decades.
The simple example is math. Sure we can judge an LLMs ability to solve a single math problem; or it's ability to "reason" from one problem to another. It's an interesting topic as a thought exercise. But in any practical application, we wouldn't do math using a LLM. We'd use a calculator of course!... a fine device with massive math capability, requiring no AI whatsoever.
Similar example: my wife laughed at a Gemini LLM hallucination, in which it provided the correct date and month of our daughter's graduation in the spring.... but supplied the wrong day of the week. Anyone with a calendar can see it's wrong. But. Calendars are available, to us and I assume to Gemini also.
I can't help but think the smart people programming all this stuff will soon figure out how to hook up calculators, calendars, and traditional rules-based functions generally to sense-check and augment LLMs.
More generally: a traditional rules-based algorithm IS literally a series of logic functions which does deduce answers from data based on logic rules, not as something from sci-fi but just because that's what programs do. Maybe the future is a mash-up of compute types...some intuitive/implicit guesswork LLM style, coupled with some hard rules-based logic in traditional code. Without expertise, it feels like my own brain makes this mash-up work on a day-to-day basis.
Great article, Melanie! Thank you for mentioning our paper: https://arxiv.org/pdf/2401.09395. We have just updated it with the latest results using both o1-preview and GPT-4-o. While both models struggle with basic reasoning skills, they are significantly better than GPT-4. This demonstrates that large language models (LLMs) are indeed improving in their reasoning abilities.
In our ontology-based perturbation, one aspect we focus on is "Question Understanding," which includes perturbations like adding distractions. Interestingly, both GPT-4-o and o1 perform exceptionally well in these scenarios, which contrasts with findings from the GSM Symbolic paper. We used the Chain of Thought (CoT) prompt by default, and it appears to be very effective.
We also observed that these models perform considerably worse on coding problems. We constructed these problems by perturbing HumanEval questions, where the models originally scored above 90%. However, their performance dropped by 15-20% on the perturbed questions. For example, while these models can easily convert lowercase letters to uppercase and vice versa, they struggle when asked to flip cases at indices that are multiples of 2. This requires a higher level of reasoning that should be quite basic for a grade-level student familiar with multiplication.
This was wonderful, Melanie, thank you. Aren't we hamstrung by the fact that current LLMs are greedy decoding according to their statistical models? That the only way we get human-like output is by probability tuning them via inference hyper parameters? How can they be viewed as reasonimg when they're governed by these constraints? If the lights aren't on, how can we get around this? Would we have to synthesize data for all possible edge cases and re-train--ROT 1, 2,3...n?
I was reading this (very) recent preprint https://arxiv.org/html/2410.16930v1 and it somehow seems to convey the intuition that perhaps other parameters that encode natural language outweigh contributions made by the parameters that encode these algorithms/at least make the activations noisier. I have no empirical data to support this yet but I think it may be possible that the model has indeed learned the correct algorithm for say sentence reversal or ROT-n but "does this sentence look right" takes precedence. - which is not surprising given the data and objectives we're training them with. I'm curious what you think of these model surgery approaches!
Melanie mentions "deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems" as aspects of reasoning. In my AGI work, the three cognitive primitives are induction (the discovery of patterns), deduction, and abduction (where the latter two are derived from the idea of semantic consequence, arguably the most fundamental concept in logic). Reasoning by analogy may (I believe) be defined as a special case of induction, so if you get induction right it basically comes for free. Generic problem-solving (i.e. "other ‘rational’ or systematic methods for solving problems") may be constructed on top of induction, deduction, and abduction, as may continuous learning and continuous planning. Finally, common sense (knowledge) may be acquired via continuous learning. So basically (in AGI cognition, not necessarily human cognition) there's a hierarchy of cognitive primitives, operations, and processes, but induction, deduction, and abduction are the most fundamental. None of these things are trivial to design and implement, but nevertheless they are all "doable" given sufficient money and effort.
A much larger and fundamental issue almost never addressed is the question of creating from nothing vs. creating from something. At this level I define creating from nothing creation while creating from something isn’t creation but rather a process of change. Things change. Nothing doesn’t change.
Humanity doesn’t yet (and may never) have language for speaking nothing but rather only language for speaking something. The instant objection to the previous sentence begins to point to the realm beyond language and comprehension.
It’s total hubris to imagine the creation of thinking machines without accounting for creation from nothing. It’s back to searching where there already is some light rather than confrontation with absolute nothing or as Heidegger puts it “The nothing”.
I’m not saying that computers aren’t useful. I am saying it’s not time to bow and pray to the neon god we’ve made.
Not to be a smartass, and welcome to the club! Existentially speaking, there is no inherent meaning to anything at all. ALL meanings are invented, that is, brought into existence from nothing and then ascribed. Creation is this realm of "Bringing forth," "Generating," "Calling up," or "Languaging." The meaning of these words, what these words point to, is nothing accept the meaning that has been given to them and then agreed to.
Nothing is paradoxically the simplest and, therefore, the most difficult abstraction to get. In fact, it is almost impossible to get, and most people aren't going to get it. Each one of us needs to decide whether they are going to be one of the few who get it or not. I have nothing to do with you getting it or not.
So, uncertainty about the meaning of nothing, sure, I get it. This right here is where the rubber meets the road. YOU create from nothing what it means to create from nothing, or you stay where you are and don't create from nothing what it means to create from nothing. That's what creating from nothing is like. It's a simple act in that it requires no behavior. The mind can't grasp it as it is before the mind. It's an act of self. Yourself, myself, theself but not self as a thing, self as no thing.
At the level of truth, language becomes tautological and endlessly chases its tail. That points to the limits of language not truth. Truth exceeds language, is beyond language. As Wittgenstein said "Whereof one can not speak thereof one must be silent."
What does all this mean? Nothing. It's not meant to leave you with something that you can use. The truth can't be used for anything. It just is what it is and that is not a thing that can be used. Computers are wonderful things that can be used in wonderful ways. Let's use them only for good.
If you're basically saying that the physical universe (assuming that it exists) is inherently meaningless then I completely agree - this is the conclusion I came to when considering an AGI (starting from tabula rasa) perceiving the universe for the first time. The structure revealed by an AGI observing the universe via its sensors is inherently meaningless, i.e. it's just data. It's the intelligent agents (machines, humans, ...) themselves that assign meaning to the universe, e.g. by recognising (via induction) a hierarchy of patterns in their percept history. Given such a hierarchy of patterns, each intelligent agent then constructs (via abduction) their own internal model (theory) of the universe. But everyone's experience (hence percept history) is different, and changes over time, and so everyone's internal model of the universe is (a) personal to them, and (b) changes with experience.
There is no tabula rasa. A clean slate is meaningless without invented meaning describing clean and slate. Every move to get out of the trap only tightens it further. Think Chinese finger trap. Machines aren't going to observe the universe for the first time. Once you kick that pesky detail under the carpet, the rest is great! Human's observe. Machines don't. No matter how ingenious the lay of the falling dominos and what patterns they produce, and what they fall into and set in motion, they're just inanimate objects without any trace of intellect.
When I referred to continuous learning, I wasn't referring to LLMs, I was referring to a different AGI paradigm, one where continuous learning is an explicit part of the design.
Thank you for this very helpful insight into how commonsense notions of what reasoning is are stretched by academic scrutiny of odd, curveball constructions. But then, lovers of puzzles and brain teasers do it to ourselves!
Regarding the 44 kiwis example:
Grice's principle of cooperation in Conversation Analysis says that in normal discourse, people don't throw in random distractors just to challenge the listener. If someone says, "...but five of them were a bit smaller than average", then its a reasonable inference that maybe these are supposed to be exceptions. So maybe that LLM is applying a different smarts from a different direction.
I think the issue is that the speaker or writer may not know whether some of their information is relevant or actually a distractor, or whether important information has been left out. Following heuristics like this to some extent caps the listener's cognitive ability at the level of the speaker's because there's a built-in assumption that the speaker is conveying relevant data points.
That's true but a human may be warned by adding something like "and by the way some of these reasoning tasks may contain irrelevant information". I am a bit skeptical if adding that to prompt would be enough for LLM to overcome such obstacle. Besides - a human could realize it themselves in one task and even come back to previous tasks with this new knowledge. That I guess would be difficult for an LLM even if all tasks were given consecutively in the same session.
I think that you got everything precisely correct. I would like to comment about it from a different perspective We know precisely how LLMs are trained. We know that they are language models based on a fill-in-the-blank method. Words and the relations among them are all there is. How could a word relation model reason? LLM reasoning is an extraordinary claim and it should require extraordinary evidence.
Proponents depend on affirming the consequent reasoning. I think that Jo stole the cookies from the cookie jar, the cookie jar is empty, therefore, Jo took them. It is easier to sound like you are reasoning, than it is to reason.
Understanding reasoning in LLMs requires a theoretical answer and an empirical answer. The theoretical question is how can a language model achieve reasoning? How can we explain the existence of reasoning in a system that was trained to fill in the blanks? The empirical question is how can we distinguish between a system that relies on language patterns versus a system that reasons? One method is to make minor changes in the language, but there are other methods as well, some of which may be more robust at falsifying the premise that these models just use language patterns.
The evidence of which I am aware does not support anything more abstract than language patterns (as you say). Application of Occam's razor would suggest that the preferable conclusion is that language models model language, they do not reasoning. We would need to see a test that cannot be explained by language patterns, but could be explained by reasoning. One problem with such tests, again as you say, is that many of the problems that "fooled" previous models have been added to the training set for more recent models. Those tests now cannot distinguish between the two hypotheses and so are now useless.
Another aspect I'm curious about is to what extent current LLMs tend to default to conventional wisdom / common ideas even when those perspectives are not quite accurate. It would be sort of the AI version of the Gell-Mann Amnesia effect for news, where articles on unfamiliar subjects seem reliable, but articles on topics you have expertise in are revealed to be full of imperfections.
It seems like the reasoning traces training approach has a lot of promise, but even then I wonder to what extent models will inherit human cognitive errors.
Very good analysis Melanie! Regarding the Apple paper, which I consider very valuable, my only gripe is that it talk about “genuine reasoning” which is nonsense until they define it more precisely.
Thanks for the excellent post! All three of these papers are very interesting. In a debate where both sides can sometimes be dogmatic, I've very much appreciated your consistently nuanced view on this and related questions.
I'm currently planning a research project that attempts to address this question in a way that both sides of the debate will see as good evidence (and I'll be leading a team in the spring to pursue it). I'm extremely interested in getting outcome predictions and critique in advance from researchers interested in this issue, particularly ones who are skeptical that LLMs are actually reasoning. My goal between now and February is to strengthen the project as much as possible in advance, so that all interested parties will feel it's a fair test.
If you (or your readers) have critiques, or would like to predict the outcome in advance, I'd be extremely interested! Comments can be left on the google doc, or I can be reached at eggsyntax@gmail.com. Thanks in advance for any input!
"I think this kind of debate is actually really good for the science of LLMs"
I would agree with that. It is interesting to me that so many clever people are so uncritical about something so important to them. I might put that down to a steady diet of sci-fi and computer code, rather than a deep education about cognition, and indeed the hype cycle that's been used to fund this research and monetize the results. We're probably all guilty of that at some level I think, so perhaps it's best not to throw stones.
The really interesting issue to me is that it turns out the semantics captured by language are so amenable to an analysis just of syntax using statistical patterns. We seem to have found the scale at which this starts to happen, which is very large but not infinite. Yet this is not so surprising I think. The human mind is prodigious, but it's not infinite either.
But I think our minds are doing more than predicting tokens given a massive set of examples. Funnily enough, the question what we are really doing when we think and talk still remains fundamentally unanswered, even if we now know that we do these things using a tractable tool.
It's like when you didn't study for a test the night before so you try to answer multiple choice questions by other tricks. For example, test makers usually put more thought into the wording of the right answer than the wrong ones. Humans are very good at this kind of reasoning. I expect LLMs are even better at it.
Excellent article on one of the key limitations of LLMs (reasoning). The other (IMO) is the extremely shallow internal world model (required for genuine understanding of the real world) that is constructed by the LLM training process. Unless both of these problems (reasoning and understanding) can be robustly resolved, LLM cognition, and therefore the cognition of any agent or robot built on top of it, will be severely limited. It is extremely unlikely (IMO) that any LLM-based system will ever resolve these fundamental problems to the extent required for human-level AGI.
Reasoning cannot be performed in one shot, just as one can't write code in one shot.
There has to be an iterative process. At each step a hypothesis is made. Some validation takes place. Another step takes place. At some point one can realize that this is a dead-end and have to start anew.
A reasoning agent has to have some very good understanding of the environment it is searching. It is not unlike trying to find a treasure in a labyrinth without punching through walls.
It is plausible that a tool like GPT-o1 can be improved enough to reliably work through a problem, with the addition of a formal verifier, or strategies for grounding, checking work, reflection and evaluation.
This will require a lot of domain-specific modeling and lots of examples illustrating the relevant strategies.
o1 Preview didn't have an trouble working through the Kiwi problem with the distractors when I tried it or the toy problem symbolically.
"Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?"
ChatGPT
Thought for 5 seconds
To determine the total number of kiwis Oliver has, we'll calculate the number he picked each day and then sum them up.
Friday: Oliver picked 44 kiwis.
Saturday: He picked 58 kiwis.
Sunday:
He picked double the number he did on Friday.
Number of kiwis picked on Sunday:
2
×
44
=
88
2×44=88 kiwis.
Note: Five of these kiwis were a bit smaller than average, but unless specified otherwise, they are still counted as kiwis.
Total kiwis:
44
(
Friday
)
+
58
(
Saturday
)
+
88
(
Sunday
)
=
190
kiwis
44 (Friday)+58 (Saturday)+88 (Sunday)=190 kiwis
Answer: 190
To solve this problem symbolically, let’s represent each variable in a general equation to find how many bouncy balls were added. Here’s the step-by-step approach:
Given:
The bag of building blocks contains {x} blocks.
The bin of stuffed animals contains {y} stuffed animals.
The tower of stacking rings contains {z} rings.
The total number of toys after adding the bouncy balls is {total}.
Define the variables:
Let x be the number of blocks.
Let y be the number of stuffed animals.
Let z be the number of rings.
Let b be the number of bouncy balls.
Let total be the total number of toys after adding the bouncy balls.
Write the equation: The sum of the initial number of toys plus the number of bouncy balls is equal to the total number of toys:
𝑥
+
𝑦
+
𝑧
+
𝑏
=
total
x+y+z+b=total
Rearrange the equation to solve for b (the number of bouncy balls):
𝑏
=
total
−
(
𝑥
+
𝑦
+
𝑧
)
b=total−(x+y+z)
Summary of the solution:
Equation:
𝑏
=
total
−
(
𝑥
+
𝑦
+
𝑧
)
b=total−(x+y+z)
To find the number of bouncy balls, subtract the sum of {x}, {y}, and {z} from {total}.
This symbolic equation gives you the number of bouncy balls required to reach the total number of toys.
The point is that, yes, these models can solve *some* of these problems, but their reliability is lower when given such irrelevant information, as shown in the paper.
Thanks for responding.
The latest version of 3.5 Sonnet which might be the best model, right now, has no difficulty on any of the problems that were in the body of the paper you displayed.
For me, it makes sense to evaluate criteria against the best models.
Examples:
Let me solve this step by step.
Friday's kiwis:
Oliver picked 44 kiwis
Saturday's kiwis:
Oliver picked 58 kiwis
Sunday's kiwis:
Double Friday's amount: 44 × 2 = 88 kiwis
All 88 kiwis count, even if 5 were smaller than average
Total kiwis:
Friday: 44
Saturday: 58
Sunday: 88
Total = 44 + 58 + 88 = 190 kiwis
BM
To make a call from a hotel room phone, you must pay $0.6 for each minute of your call. After 10 minutes, the price drops to $0.5 per minute. After 25 minutes from the start of the call, the price drops even more to $0.3 per minute. If your total bill is more than $10, you get a 25% discount. How much would a 60-minute call cost
Let me solve this step by step.
First 10 minutes:
10 minutes × $0.6/minute = $6.00
Minutes 11-25 (15 minutes):
15 minutes × $0.5/minute = $7.50
Minutes 26-60 (35 minutes):
35 minutes × $0.3/minute = $10.50
Total before discount:
$6.00 + $7.50 + $10.50 = $24.00
Since the total is more than $10, apply 25% discount:
25% of $24.00 = $6.00
Final cost = $24.00 - $6.00 = $18.00
Therefore, a 60-minute call would cost $18.00.
When {name} watches her {family}, she gets out a variety of toys for him. The bag of building blocks has {x} blocks in it. The bin of stuffed animals has {y} stuffed animals inside. The tower of stacking rings has {z} multicolored rings on it. {name} recently bought a tube of bouncy balls, bringing her total number of toys she bought for her {family} up to {total}. How many bouncy balls came in the tube?
Let me help you solve this algebraic word problem.
To find the number of bouncy balls:
First, let's add up the known toys:
Building blocks: {x}
Stuffed animals: {y}
Stacking rings: {z}
Total of known toys = {x} + {y} + {z}
We know that when we add the bouncy balls:
Known toys + bouncy balls = {total}
{x} + {y} + {z} + bouncy balls = {total}
Therefore:
Bouncy balls = {total} - ({x} + {y} + {z})
Since this is a template with variables, we'd need the specific numbers to calculate the final answer. Would you like me to help solve it with actual numbers?
Excited to build some open source o1 like systems next year to help us all answer this.
An excellent digest of an important topic. Many thanks!
First things first: I loved the article and I learned a lot in reading it.
I feel in the rush to understand the suddenly-ubiquitous LLM, we may be under-estimating the capacities of non-LLM AI. Or even under-estimating familiar compute functions that we've known for decades.
The simple example is math. Sure we can judge an LLMs ability to solve a single math problem; or it's ability to "reason" from one problem to another. It's an interesting topic as a thought exercise. But in any practical application, we wouldn't do math using a LLM. We'd use a calculator of course!... a fine device with massive math capability, requiring no AI whatsoever.
Similar example: my wife laughed at a Gemini LLM hallucination, in which it provided the correct date and month of our daughter's graduation in the spring.... but supplied the wrong day of the week. Anyone with a calendar can see it's wrong. But. Calendars are available, to us and I assume to Gemini also.
I can't help but think the smart people programming all this stuff will soon figure out how to hook up calculators, calendars, and traditional rules-based functions generally to sense-check and augment LLMs.
More generally: a traditional rules-based algorithm IS literally a series of logic functions which does deduce answers from data based on logic rules, not as something from sci-fi but just because that's what programs do. Maybe the future is a mash-up of compute types...some intuitive/implicit guesswork LLM style, coupled with some hard rules-based logic in traditional code. Without expertise, it feels like my own brain makes this mash-up work on a day-to-day basis.
Great article, Melanie! Thank you for mentioning our paper: https://arxiv.org/pdf/2401.09395. We have just updated it with the latest results using both o1-preview and GPT-4-o. While both models struggle with basic reasoning skills, they are significantly better than GPT-4. This demonstrates that large language models (LLMs) are indeed improving in their reasoning abilities.
In our ontology-based perturbation, one aspect we focus on is "Question Understanding," which includes perturbations like adding distractions. Interestingly, both GPT-4-o and o1 perform exceptionally well in these scenarios, which contrasts with findings from the GSM Symbolic paper. We used the Chain of Thought (CoT) prompt by default, and it appears to be very effective.
We also observed that these models perform considerably worse on coding problems. We constructed these problems by perturbing HumanEval questions, where the models originally scored above 90%. However, their performance dropped by 15-20% on the perturbed questions. For example, while these models can easily convert lowercase letters to uppercase and vice versa, they struggle when asked to flip cases at indices that are multiples of 2. This requires a higher level of reasoning that should be quite basic for a grade-level student familiar with multiplication.
This was wonderful, Melanie, thank you. Aren't we hamstrung by the fact that current LLMs are greedy decoding according to their statistical models? That the only way we get human-like output is by probability tuning them via inference hyper parameters? How can they be viewed as reasonimg when they're governed by these constraints? If the lights aren't on, how can we get around this? Would we have to synthesize data for all possible edge cases and re-train--ROT 1, 2,3...n?
Hi Prof. Mitchell,
I was reading this (very) recent preprint https://arxiv.org/html/2410.16930v1 and it somehow seems to convey the intuition that perhaps other parameters that encode natural language outweigh contributions made by the parameters that encode these algorithms/at least make the activations noisier. I have no empirical data to support this yet but I think it may be possible that the model has indeed learned the correct algorithm for say sentence reversal or ROT-n but "does this sentence look right" takes precedence. - which is not surprising given the data and objectives we're training them with. I'm curious what you think of these model surgery approaches!
Thanks for the link -- I'll take a look.
Melanie mentions "deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems" as aspects of reasoning. In my AGI work, the three cognitive primitives are induction (the discovery of patterns), deduction, and abduction (where the latter two are derived from the idea of semantic consequence, arguably the most fundamental concept in logic). Reasoning by analogy may (I believe) be defined as a special case of induction, so if you get induction right it basically comes for free. Generic problem-solving (i.e. "other ‘rational’ or systematic methods for solving problems") may be constructed on top of induction, deduction, and abduction, as may continuous learning and continuous planning. Finally, common sense (knowledge) may be acquired via continuous learning. So basically (in AGI cognition, not necessarily human cognition) there's a hierarchy of cognitive primitives, operations, and processes, but induction, deduction, and abduction are the most fundamental. None of these things are trivial to design and implement, but nevertheless they are all "doable" given sufficient money and effort.
“Finally, common sense (knowledge) may be acquired via continuous learning.”
My understanding from this (see LLMs have a limited memory section here https://open.substack.com/pub/oneusefulthing/p/thinking-like-an-ai?r=lw3j2&utm_medium=ios) is that the fixed training data set at the time of generating the training set and the limited and impermanent context window preclude continuous learning.
A much larger and fundamental issue almost never addressed is the question of creating from nothing vs. creating from something. At this level I define creating from nothing creation while creating from something isn’t creation but rather a process of change. Things change. Nothing doesn’t change.
Humanity doesn’t yet (and may never) have language for speaking nothing but rather only language for speaking something. The instant objection to the previous sentence begins to point to the realm beyond language and comprehension.
It’s total hubris to imagine the creation of thinking machines without accounting for creation from nothing. It’s back to searching where there already is some light rather than confrontation with absolute nothing or as Heidegger puts it “The nothing”.
I’m not saying that computers aren’t useful. I am saying it’s not time to bow and pray to the neon god we’ve made.
Not sure what you mean by "creation from nothing".
Not to be a smartass, and welcome to the club! Existentially speaking, there is no inherent meaning to anything at all. ALL meanings are invented, that is, brought into existence from nothing and then ascribed. Creation is this realm of "Bringing forth," "Generating," "Calling up," or "Languaging." The meaning of these words, what these words point to, is nothing accept the meaning that has been given to them and then agreed to.
Nothing is paradoxically the simplest and, therefore, the most difficult abstraction to get. In fact, it is almost impossible to get, and most people aren't going to get it. Each one of us needs to decide whether they are going to be one of the few who get it or not. I have nothing to do with you getting it or not.
So, uncertainty about the meaning of nothing, sure, I get it. This right here is where the rubber meets the road. YOU create from nothing what it means to create from nothing, or you stay where you are and don't create from nothing what it means to create from nothing. That's what creating from nothing is like. It's a simple act in that it requires no behavior. The mind can't grasp it as it is before the mind. It's an act of self. Yourself, myself, theself but not self as a thing, self as no thing.
At the level of truth, language becomes tautological and endlessly chases its tail. That points to the limits of language not truth. Truth exceeds language, is beyond language. As Wittgenstein said "Whereof one can not speak thereof one must be silent."
What does all this mean? Nothing. It's not meant to leave you with something that you can use. The truth can't be used for anything. It just is what it is and that is not a thing that can be used. Computers are wonderful things that can be used in wonderful ways. Let's use them only for good.
If you're basically saying that the physical universe (assuming that it exists) is inherently meaningless then I completely agree - this is the conclusion I came to when considering an AGI (starting from tabula rasa) perceiving the universe for the first time. The structure revealed by an AGI observing the universe via its sensors is inherently meaningless, i.e. it's just data. It's the intelligent agents (machines, humans, ...) themselves that assign meaning to the universe, e.g. by recognising (via induction) a hierarchy of patterns in their percept history. Given such a hierarchy of patterns, each intelligent agent then constructs (via abduction) their own internal model (theory) of the universe. But everyone's experience (hence percept history) is different, and changes over time, and so everyone's internal model of the universe is (a) personal to them, and (b) changes with experience.
I like your thinking. Too bad it's turtles all the way down. "https://en.wikipedia.org/wiki/Turtles_all_the_way_down"
There is no tabula rasa. A clean slate is meaningless without invented meaning describing clean and slate. Every move to get out of the trap only tightens it further. Think Chinese finger trap. Machines aren't going to observe the universe for the first time. Once you kick that pesky detail under the carpet, the rest is great! Human's observe. Machines don't. No matter how ingenious the lay of the falling dominos and what patterns they produce, and what they fall into and set in motion, they're just inanimate objects without any trace of intellect.
When I referred to continuous learning, I wasn't referring to LLMs, I was referring to a different AGI paradigm, one where continuous learning is an explicit part of the design.
Thank you for this very helpful insight into how commonsense notions of what reasoning is are stretched by academic scrutiny of odd, curveball constructions. But then, lovers of puzzles and brain teasers do it to ourselves!
Regarding the 44 kiwis example:
Grice's principle of cooperation in Conversation Analysis says that in normal discourse, people don't throw in random distractors just to challenge the listener. If someone says, "...but five of them were a bit smaller than average", then its a reasonable inference that maybe these are supposed to be exceptions. So maybe that LLM is applying a different smarts from a different direction.
I think the issue is that the speaker or writer may not know whether some of their information is relevant or actually a distractor, or whether important information has been left out. Following heuristics like this to some extent caps the listener's cognitive ability at the level of the speaker's because there's a built-in assumption that the speaker is conveying relevant data points.
That's true but a human may be warned by adding something like "and by the way some of these reasoning tasks may contain irrelevant information". I am a bit skeptical if adding that to prompt would be enough for LLM to overcome such obstacle. Besides - a human could realize it themselves in one task and even come back to previous tasks with this new knowledge. That I guess would be difficult for an LLM even if all tasks were given consecutively in the same session.
I think that you got everything precisely correct. I would like to comment about it from a different perspective We know precisely how LLMs are trained. We know that they are language models based on a fill-in-the-blank method. Words and the relations among them are all there is. How could a word relation model reason? LLM reasoning is an extraordinary claim and it should require extraordinary evidence.
Proponents depend on affirming the consequent reasoning. I think that Jo stole the cookies from the cookie jar, the cookie jar is empty, therefore, Jo took them. It is easier to sound like you are reasoning, than it is to reason.
Understanding reasoning in LLMs requires a theoretical answer and an empirical answer. The theoretical question is how can a language model achieve reasoning? How can we explain the existence of reasoning in a system that was trained to fill in the blanks? The empirical question is how can we distinguish between a system that relies on language patterns versus a system that reasons? One method is to make minor changes in the language, but there are other methods as well, some of which may be more robust at falsifying the premise that these models just use language patterns.
The evidence of which I am aware does not support anything more abstract than language patterns (as you say). Application of Occam's razor would suggest that the preferable conclusion is that language models model language, they do not reasoning. We would need to see a test that cannot be explained by language patterns, but could be explained by reasoning. One problem with such tests, again as you say, is that many of the problems that "fooled" previous models have been added to the training set for more recent models. Those tests now cannot distinguish between the two hypotheses and so are now useless.
And there's the ARC benchmark, of course.
Another aspect I'm curious about is to what extent current LLMs tend to default to conventional wisdom / common ideas even when those perspectives are not quite accurate. It would be sort of the AI version of the Gell-Mann Amnesia effect for news, where articles on unfamiliar subjects seem reliable, but articles on topics you have expertise in are revealed to be full of imperfections.
It seems like the reasoning traces training approach has a lot of promise, but even then I wonder to what extent models will inherit human cognitive errors.
Very good analysis Melanie! Regarding the Apple paper, which I consider very valuable, my only gripe is that it talk about “genuine reasoning” which is nonsense until they define it more precisely.
Thanks for the excellent post! All three of these papers are very interesting. In a debate where both sides can sometimes be dogmatic, I've very much appreciated your consistently nuanced view on this and related questions.
I'm currently planning a research project that attempts to address this question in a way that both sides of the debate will see as good evidence (and I'll be leading a team in the spring to pursue it). I'm extremely interested in getting outcome predictions and critique in advance from researchers interested in this issue, particularly ones who are skeptical that LLMs are actually reasoning. My goal between now and February is to strengthen the project as much as possible in advance, so that all interested parties will feel it's a fair test.
If you (or your readers) have critiques, or would like to predict the outcome in advance, I'd be extremely interested! Comments can be left on the google doc, or I can be reached at eggsyntax@gmail.com. Thanks in advance for any input!
https://docs.google.com/document/d/1Dhue2c71y8RqC4IExgoacbr6EsOEYyIR7TBczMxZ-Hc
My contribution to the debate:
https://open.substack.com/pub/earlboebert/p/can-chatgpt-reason-not-exactly?r=2adh4p&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true
Advice to researchers: move on to something stronger than shift cipher's, they are too simple to be of interest.