What should we believe about the reasoning abilities of today’s large language models? As the headlines above illustrate, there’s a debate raging over whether these enormous pre-trained neural networks have achieved humanlike reasoning abilities, or whether their skills are in fact “a mirage.”
Thanks for this very clear exposition. Talking as if you are reasoning is different from reasoning. To claim that a system that is known to work by predicting the next word is reasoning is an extraordinary claim, but,as you point out, there is little critical evaluation. It should require extraordinary evidence. The public, the governments, and some computer scientists are being bamboozled into thinking that these models do things that they are incapable of doing without considering the possibility that they are just following the statistical language patterns that they have learned. This should be just basic science, but evident,y not. Your critical thought is essential in the medium and long run. Thanks!
I have been focusing on music instead of AI lately :) But a while back I looked into the chain-of-thought reasoning claims and I do not think it constitutes any reasoning whatsoever, nor even a good illusion.
The most significant observation: suppose an LLM can solve problem A and problem B with chain-of-thought prompting. Then if the LLM truly understood step-by-step thinking, it should be able to solve "first do A, then do B" with chain-of-thought prompting, since that's still step-by-step. But this is not the case! And I suspect the reason is that the linguistics trick fails because it can't handle conjunctions.
Concretely: GPT-4 tends to guess on counting problems unless you specify chain-of-thought prompting, in which case it'll put the problem in 1-1 correspondence with the whole numbers and get the right answer. But if you ask it to count *two* things step-by-step, it goes right back to inaccurate guessing. The weakness and unreliability of chain-of-thought prompting goes against the rosy anthropomorphic interpretation.
Another way of looking at it: if you ask GPT-4 to explain a simple fact "step-by-step," it will throw in a bunch of extraneous "steps" that aren't germane to the fact. The shape of step-by-step reasoning is what GPT is going for, but it doesn't understand why the tactic works for some problems and fails for others.
I am verbose and not an expert so you might just want to jump to the screenshots...
"Reasoning" is the application of knowledge to a problem statement that results in making certain information explicit, namely, an answer that was always there, it just wasn't written down yet.
If the knowledge permits derivation of the result in one step of pattern matching---equivalent to constraint satisfaction---then we can expect many architectures to work. If instead, multiple steps are required, then the problem becomes one of search. Search requires keeping track of intermediate results, along with info to navigate the search space.
Where might an LLM hold this information?
In a transformer, in order to parse and emit natural language, the early and late layers probably must attend primarily to lexical and syntactic matters. Presumably the middle layers are the ones that can afford to represent semantics, including various forms of abstraction. Although, the activation vectors must share different time scales of information as the residual stream gets modified: local linguistic patterns must be carried from input to output, but through superposition, activations in the middle layers also carry longer range pressures and constraints associated with categories and manifolds underlying the structure of the problem domain and the problem statement.
Does a transformer LLM have enough room to hold intermediate steps of a complex reasoning task?
The amount of activation vector capacity is the depth of the network times the number of tokens in play. Chain-of-Thought expands capacity by allocating tokens in the context vector to *procedural steps* in the reasoning process. Instead of the model having to internally shoehorn placeholders for its location in the search tree into a short context vector consisting of the problem statement and a compact output, it now has the luxury of seeing where it is in the search space there in writing (output tokens representing steps along the chain of thought). And crucially, intermediate results are made explicit in the context vector, which allows subdivision of a large and complex reasoning process into smaller parts, each of which can be solved with a smaller, more tractable pattern matching step.
Transformers are not the only LLM architecture. Other architectures, especially ones that hold a great deal of internal state that is not closely constrained to text tokens, might behave quite differently.
Thanks for this insightful post.
"Take a deep breath and work step-by-step!" is now more effective in some tasks than "Let’s think step by step". https://arxiv.org/pdf/2309.03409.pdf
Seems the teams at deepmind were right all along. The same patterns of -content effects- affect both humans and LLMs.
One thing we should look for to detect abstract reasoning is cases where abstract reasoning leads to errors. The classic examples are things like "Birds can fly, a penguin is a kind of bird, therefore a penguin flies." Abstract reasoning works by discarding most (or all) of the context and applying an abstract rule to draw a conclusion. This is dangerous, of course, and can lead to faulty conclusions. I suspect that human inference includes a third step, checking the conclusion for plausibility by referring back to what is known about the context. I guess we would need to create novel contexts where the AI system lacks (pre-training) experience in order to detect these kinds of errors.
Thanks for a very insightful and detailed discussion on the cognitive ability and in some sense a step towards understanding what creativity means in the context of LLM/GPT.
I am not sure what your thoughts are on this (or you know of any discussions related to this: on reasoning and creativity, I think these are very important and practical questions. This is because as we deal with legality of patents and copyrights, the central questions also evolve about what are the intellectual capabilities of the AI. So these are no longer academic or just-philosophical questions but issues of which will affect how AI legislations and impacts will be.
Did you catch the paper- The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python?
That would have made a perfect inclusion in this article
The discourse around the reasoning abilities of LLMs is indeed multi-faceted. While the emergence of CoT prompting has unveiled certain latent capabilities in these models, the depth of true reasoning versus sophisticated pattern recognition remains an open question. The studies you mentioned indeed hint at a more complex interplay between memorization and reasoning, showcasing not just the strides made in AI development but also the intricate pathway that lies ahead in achieving genuine artificial general intelligence.
Exciting times ahead!
I truly value this Informative article, thank you! Have featured your article in our today's newsletter. :)
Really appreciated this clear discussion, thank you!
As with most of AI discussions these days, we get so caught up in the glitzy outputs we don’t stop to define our terms or think robustly about what is required to test. Thank you for helping cut through the noise!
That's a wonderful summary of the SoTA, Melanie. I have written about the supposed "emergent" features here:
providing a perspective from complexity science.
I’m interested in the question of an LLM “memorizing” something. When I prompt ChatGPT with “To be or not to be”–nothing more–it returns the entire soliloquy along with a bit of commentary. Given that it was trained to predict the next word, what must have been the case for it to be able to return that entire soliloquy, word for word?
Shakespeare’s “Hamlet” is a well-known play and likely appeared many times in the training corpus. That soliloquy is also well-known and probably appeared many times independently of the play itself, along with commentary. Given that GPT encountered that particular string many times, it makes sense that it should have “memorized” it, whatever that means.
Now, what happens when you prompt it with a phrase from the soliloquy. I opened a new session and prompted it with “The insolence of office”. It returned pretty much the entire soliloquy. “The slings and arrows” (another new session) returned the first five lines of the soliloquy (it begins the third line.) Then I prompted it with “and sweat under a,” (new session) which is from the middle of a line a bit past the middle. ChatGPT didn’t recognize it, but then did so when I told it that is was from Hamlet’s famous soliloquy. I think this is worth further exploring – prompting with phrases from various locations – but haven’t done so yet.
Then we have the rather different situation of things which are (likely to have been) in the training corpus, but do not show up when prompted for. Years ago I heard Dizzy Gillespie play at Left Bank Jazz Society in Baltimore. I blogged about it twice, both times will before the cut-off date for training. I have no way of knowing whether or not those blog posts were actually in the training corpus, but I gave ChatGPT the following prompt: “Dizzy Gillespie plays for the Left Bank Jazz Society in Baltimore’s Famous Ballroom.” It didn’t recognize the event and gave a confused reply. I then named the two blogs where I’d posted about the concert. Again, nothing.
Since I don’t know whether or not those blog posts were actually in the training corpus, I don’t know quite what to think. But, my default belief at this time is the there’s a bunch of stuff in the corpus that never shows up in response to prompting because the events only didn’t appear very often in the corpus.
Between those two cases we have something like the Johnstown flood of 1889, a historical event of some moderate importance, but certainly not as prominent as, say, the bombing of Pearl Harbor. I prompted ChatGPT with “Johnstown flood, 1889,” and got a reasonable response. (Having grown up in Johnstown, I know something about that flood.) I issued the same prompt at a different session and again got a reasonable response, but one that was different from the first.
I’ve written this up in a blog post: https://new-savanna.blogspot.com/2023/09/what-must-be-case-that-chatgpt-would.html
Best overview on the subject of reasoning to date! Great read
Great post. Always good to read critical overviews of large neural networks and the hype surrounding their cognitive capabilities. LLMs are very good at confabulation as would be expected from their training process (because they essentially compress a large corpus of text onto a smooth manifold) so they're great at generating some text on some topic with a specific style or content that combines 2 or more topics like "Adelic Quantum Group Representations" (https://news.ycombinator.com/item?id=37368561) but no one should expect smooth manifold approximations of text (and other modalities) to have any logical or abstract reasoning capabilities.
What an enlightning post! In the same vein look at a recent talk by Evangelina Fedorenko from MIT (https://bit.ly/40TDF9I) based on direct brain observations (EEG, fRMI, etc.). She shows that language and thought are only weakly overlapped even in the human brain.
I know some LLMs that can reason a lot better than some humans I know