54 Comments
Sep 11, 2023Liked by Melanie Mitchell

Thanks for this very clear exposition. Talking as if you are reasoning is different from reasoning. To claim that a system that is known to work by predicting the next word is reasoning is an extraordinary claim, but,as you point out, there is little critical evaluation. It should require extraordinary evidence. The public, the governments, and some computer scientists are being bamboozled into thinking that these models do things that they are incapable of doing without considering the possibility that they are just following the statistical language patterns that they have learned. This should be just basic science, but evident,y not. Your critical thought is essential in the medium and long run. Thanks!

Expand full comment
Sep 11, 2023Liked by Melanie Mitchell

"Reasoning" is the application of knowledge to a problem statement that results in making certain information explicit, namely, an answer that was always there, it just wasn't written down yet.

If the knowledge permits derivation of the result in one step of pattern matching---equivalent to constraint satisfaction---then we can expect many architectures to work. If instead, multiple steps are required, then the problem becomes one of search. Search requires keeping track of intermediate results, along with info to navigate the search space.

Where might an LLM hold this information?

In a transformer, in order to parse and emit natural language, the early and late layers probably must attend primarily to lexical and syntactic matters. Presumably the middle layers are the ones that can afford to represent semantics, including various forms of abstraction. Although, the activation vectors must share different time scales of information as the residual stream gets modified: local linguistic patterns must be carried from input to output, but through superposition, activations in the middle layers also carry longer range pressures and constraints associated with categories and manifolds underlying the structure of the problem domain and the problem statement.

Does a transformer LLM have enough room to hold intermediate steps of a complex reasoning task?

The amount of activation vector capacity is the depth of the network times the number of tokens in play. Chain-of-Thought expands capacity by allocating tokens in the context vector to *procedural steps* in the reasoning process. Instead of the model having to internally shoehorn placeholders for its location in the search tree into a short context vector consisting of the problem statement and a compact output, it now has the luxury of seeing where it is in the search space there in writing (output tokens representing steps along the chain of thought). And crucially, intermediate results are made explicit in the context vector, which allows subdivision of a large and complex reasoning process into smaller parts, each of which can be solved with a smaller, more tractable pattern matching step.

Transformers are not the only LLM architecture. Other architectures, especially ones that hold a great deal of internal state that is not closely constrained to text tokens, might behave quite differently.

Expand full comment
Sep 11, 2023Liked by Melanie Mitchell

One thing we should look for to detect abstract reasoning is cases where abstract reasoning leads to errors. The classic examples are things like "Birds can fly, a penguin is a kind of bird, therefore a penguin flies." Abstract reasoning works by discarding most (or all) of the context and applying an abstract rule to draw a conclusion. This is dangerous, of course, and can lead to faulty conclusions. I suspect that human inference includes a third step, checking the conclusion for plausibility by referring back to what is known about the context. I guess we would need to create novel contexts where the AI system lacks (pre-training) experience in order to detect these kinds of errors.

Expand full comment

Thanks for this insightful post.

"Take a deep breath and work step-by-step!" is now more effective in some tasks than "Let’s think step by step". https://arxiv.org/pdf/2309.03409.pdf

Seems the teams at deepmind were right all along. The same patterns of -content effects- affect both humans and LLMs.

Expand full comment
Sep 13, 2023Liked by Melanie Mitchell

Thanks for a very insightful and detailed discussion on the cognitive ability and in some sense a step towards understanding what creativity means in the context of LLM/GPT.

I am not sure what your thoughts are on this (or you know of any discussions related to this: on reasoning and creativity, I think these are very important and practical questions. This is because as we deal with legality of patents and copyrights, the central questions also evolve about what are the intellectual capabilities of the AI. So these are no longer academic or just-philosophical questions but issues of which will affect how AI legislations and impacts will be.

Expand full comment

Did you catch the paper- The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python?

That would have made a perfect inclusion in this article

https://arxiv.org/abs/2305.15507

Expand full comment

Really appreciated this clear discussion, thank you!

As with most of AI discussions these days, we get so caught up in the glitzy outputs we don’t stop to define our terms or think robustly about what is required to test. Thank you for helping cut through the noise!

Expand full comment
Sep 14, 2023Liked by Melanie Mitchell

The discourse around the reasoning abilities of LLMs is indeed multi-faceted. While the emergence of CoT prompting has unveiled certain latent capabilities in these models, the depth of true reasoning versus sophisticated pattern recognition remains an open question. The studies you mentioned indeed hint at a more complex interplay between memorization and reasoning, showcasing not just the strides made in AI development but also the intricate pathway that lies ahead in achieving genuine artificial general intelligence.

Exciting times ahead!

Expand full comment

I truly value this Informative article, thank you! Have featured your article in our today's newsletter. :)

Expand full comment

That's a wonderful summary of the SoTA, Melanie. I have written about the supposed "emergent" features here:

https://manlius.substack.com/p/navigating-the-transformative-potential

providing a perspective from complexity science.

Expand full comment
Sep 11, 2023·edited Sep 12, 2023Liked by Melanie Mitchell

I’m interested in the question of an LLM “memorizing” something. When I prompt ChatGPT with “To be or not to be”–nothing more–it returns the entire soliloquy along with a bit of commentary. Given that it was trained to predict the next word, what must have been the case for it to be able to return that entire soliloquy, word for word?

Shakespeare’s “Hamlet” is a well-known play and likely appeared many times in the training corpus. That soliloquy is also well-known and probably appeared many times independently of the play itself, along with commentary. Given that GPT encountered that particular string many times, it makes sense that it should have “memorized” it, whatever that means.

Now, what happens when you prompt it with a phrase from the soliloquy. I opened a new session and prompted it with “The insolence of office”. It returned pretty much the entire soliloquy. “The slings and arrows” (another new session) returned the first five lines of the soliloquy (it begins the third line.) Then I prompted it with “and sweat under a,” (new session) which is from the middle of a line a bit past the middle. ChatGPT didn’t recognize it, but then did so when I told it that is was from Hamlet’s famous soliloquy. I think this is worth further exploring – prompting with phrases from various locations – but haven’t done so yet.

Then we have the rather different situation of things which are (likely to have been) in the training corpus, but do not show up when prompted for. Years ago I heard Dizzy Gillespie play at Left Bank Jazz Society in Baltimore. I blogged about it twice, both times will before the cut-off date for training. I have no way of knowing whether or not those blog posts were actually in the training corpus, but I gave ChatGPT the following prompt: “Dizzy Gillespie plays for the Left Bank Jazz Society in Baltimore’s Famous Ballroom.” It didn’t recognize the event and gave a confused reply. I then named the two blogs where I’d posted about the concert. Again, nothing.

Since I don’t know whether or not those blog posts were actually in the training corpus, I don’t know quite what to think. But, my default belief at this time is the there’s a bunch of stuff in the corpus that never shows up in response to prompting because the events only didn’t appear very often in the corpus.

Between those two cases we have something like the Johnstown flood of 1889, a historical event of some moderate importance, but certainly not as prominent as, say, the bombing of Pearl Harbor. I prompted ChatGPT with “Johnstown flood, 1889,” and got a reasonable response. (Having grown up in Johnstown, I know something about that flood.) I issued the same prompt at a different session and again got a reasonable response, but one that was different from the first.

I’ve written this up in a blog post: https://new-savanna.blogspot.com/2023/09/what-must-be-case-that-chatgpt-would.html

Expand full comment

Best overview on the subject of reasoning to date! Great read

Expand full comment
Sep 11, 2023Liked by Melanie Mitchell

​​What an enlightning post! In the same vein look at a recent talk by Evangelina Fedorenko from MIT (https://bit.ly/40TDF9I) based on direct brain observations (EEG, fRMI, etc.). She shows that language and thought are only weakly overlapped even in the human brain.

Expand full comment

I know some LLMs that can reason a lot better than some humans I know

Expand full comment

Hello Melanie,

First of all, I would like to thank you for having periodically rekindled my interest in maths and computer science - each one of your three technical books (Genetic Algorithms, Analogy making and Complexity) made me reëvaluate entire fields, and your SFI lectures are my go-to suggestion for whoever shows an interest in modeling biological or physical systems.

Because of this, I have spent the year preceding last January hoping that our opinions would somehow converge - trust me when I say that, from the start, my default stance had been: "ok, i got it wrong. let's see how".

Unfortunately, that search proved fruitless. I have seen incredible behaviours emerging from LLMs and today, looking through your examples, I perceive something that in pretty much anyone else I would be drawn to chalk up to confirmation bias. Please bear with me - if you object to any of my claims and observations, there's an offer at the end.

1. Whose human level?

As you note at the end, most of the examples could be chalked up to perfectly humdrum phenomena, common to humans as well as machine (more salient and common connections are easier to reason with; arbitrarily changing some syntax rules in a programming language leaves less short-term memory to solve the problem).

On the other hand, you underline how, unlike machines, "humans are (at least in some cases) capable of abstract, content-independent reasoning". I found this puzzling: you have yourself highlighted how the performance of the models was surely better than chance - I would wager, better than the median human's results. That must count as "at least in come cases".

("if given enough time", by the way, is the operative phrase here - and could help explain how come that, while the "reasoning steps" seem tacked on, their presence still leads to a higher chance of correct responses).

2. Whatever happened to analogies?

It seems to me that the current systems excel in them, and just two years ago analogies based on deep, complex structures - “to discover insightful analogies, and to do so in a psychologically realistic way.”, - were still an acceptable benchmark, and one that was likely unattainable. What changed? Or, if you're still of the same advise, what kind of demonstration would change your mind?

Speaking of which, my offer is as follows.

If you could provide me with a problem that could be solved in an analogical* fashion - or a set thereof, or even pointers towards a class of such problems - which, if solved by one current SOTA LLM, would change your mind on the matter, I would be more than happy to provide a rigorous system able to produce a replicable solution.

Mostly, I think LLMs are an amazing tool for expanding one's space of possibility, and if there's anything I could do to convince you to give them a fair chance, that would be a surefire way for me to secure a ticket to consequentialist heaven.

Thank you again so much for all you've done!

Lumps

* ie, not algebraic/formal - way: something that a brilliant, sensitive and intuitive humanities student with nothing but rudiments of college math and computer science could solve

Expand full comment

"leading to the hypothesis that LLMs do not perform robust abstract reasoning to solve problems, but instead solve problems (at least in part) by identifying patterns in their training data that match, or are similar to, or are otherwise related to the text of the prompts they are given." This would be obvious. LLMs learn on training data, which is all the data they have access to. They can't form original ideas. When asked a question, they search their known corpus for a match. The order in which they search, find results, then use those results depends in large part on the human instructions. LLMs do not "think" in any way. They are a new form of structured data that users need to know how to search properly in order to get usable results.

Expand full comment