What should we believe about the reasoning abilities of today’s large language models? As the headlines above illustrate, there’s a debate raging over whether these enormous pre-trained neural networks have achieved humanlike reasoning abilities, or whether their skills are in fact “a mirage.”
Thanks for this very clear exposition. Talking as if you are reasoning is different from reasoning. To claim that a system that is known to work by predicting the next word is reasoning is an extraordinary claim, but,as you point out, there is little critical evaluation. It should require extraordinary evidence. The public, the governments, and some computer scientists are being bamboozled into thinking that these models do things that they are incapable of doing without considering the possibility that they are just following the statistical language patterns that they have learned. This should be just basic science, but evident,y not. Your critical thought is essential in the medium and long run. Thanks!
What is your model of cognition? Are you positive it doesn't involve predicting the next action based on previous states and external input, deep down - except the inputs and the objects of your actions are more highly dimensional than text?
Most models of cognition encompass both rule-based and what you might call automatic prediction-based behavior. In such models, the system first learns rules that it can execute slowly but successfully. With practice, these rules are converted into fast, automatic predictive procedures. Many people have noted that LLMs are trained only to produce the fast, automatic predictive procedure. This is why it would be very surprising if they have also learned the general rules, since no part of the architecture appears to support rule-based reasoning.
See my separate comment on the main post. There is no single minimal demonstration, as I would like to see many pieces of convergent evidence. I'm very interested in the mechanistic interpretation work, which seeks to uncover the internal mechanisms of these models. If there are internal mechanisms that can effectively behave like binding a constant to a variable in an abstract rule, that would be very exciting. In the human brain, the pre-frontal cortex recognizes when the human is in a novel situation where only rule-based reasoning can be applied. Finding a similar mechanism in an LLM would also be extremely interesting.
That paper addresses a different question, which is whether LLMs construct mental models of situations. While mental models are important, they do not correspond to abstract reasoning but instead support efficient "concrete" reasoning. The evidence in that paper is suggestive, but not conclusive.
This comment is a step in the right direction of discussions that we, the community interested in artificial intelligence, ought to have. Cognition does "involve predicting the next action." The better question is: "Is predicting the next word sufficient to produce other cognitive processes?" A claim that an LLM uses reasoning, or thinking, or is sentient is a claim that it has certain cognitive properties. A language model can say that it is sentient, for example, but is that evidence that it is? Two subquestions: Is language sufficient to implement cognitive processes (such as reasoning)? Is the behavior sufficient to demonstrate these processes? Mitchell cites some articles that show that small changes in the language of a question yield different patterns of results. Some words demonstrate apparent presence of a property, but closely related words do not. There are lots of these examples. What I was mentioning was that even if a model behaves as if it had a cognitive process, that alone does not allow one to conclude that it does have that property. Affirming the consequent is a legal fallacy. Consider a situation "If X then Y," "If a model has reasoning, then it will solve reasoning problems." Observe "Y," observe the model solving reasoning problems. We would be wrong to conclude that this observation demonstrates that the model has reasoning, there could be another cause, such as being able to repeat the right words. If Lincoln was killed by a robot, then he is dead. Lincoln is dead. It would erroneous, however, to conclude the Lincoln was killed by a robot. He was killed by a human, but he is still dead. Affirming the consequent is the dominant means by which people try to demonstrate that models have some cognitive capacity, but we also have to consider alternative potential causes. Actors may say lines as if they were mathematical geniuses without being, in fact, geniuses. Sounding like a genius is not the same as being a genius.
"Reasoning" is the application of knowledge to a problem statement that results in making certain information explicit, namely, an answer that was always there, it just wasn't written down yet.
If the knowledge permits derivation of the result in one step of pattern matching---equivalent to constraint satisfaction---then we can expect many architectures to work. If instead, multiple steps are required, then the problem becomes one of search. Search requires keeping track of intermediate results, along with info to navigate the search space.
Where might an LLM hold this information?
In a transformer, in order to parse and emit natural language, the early and late layers probably must attend primarily to lexical and syntactic matters. Presumably the middle layers are the ones that can afford to represent semantics, including various forms of abstraction. Although, the activation vectors must share different time scales of information as the residual stream gets modified: local linguistic patterns must be carried from input to output, but through superposition, activations in the middle layers also carry longer range pressures and constraints associated with categories and manifolds underlying the structure of the problem domain and the problem statement.
Does a transformer LLM have enough room to hold intermediate steps of a complex reasoning task?
The amount of activation vector capacity is the depth of the network times the number of tokens in play. Chain-of-Thought expands capacity by allocating tokens in the context vector to *procedural steps* in the reasoning process. Instead of the model having to internally shoehorn placeholders for its location in the search tree into a short context vector consisting of the problem statement and a compact output, it now has the luxury of seeing where it is in the search space there in writing (output tokens representing steps along the chain of thought). And crucially, intermediate results are made explicit in the context vector, which allows subdivision of a large and complex reasoning process into smaller parts, each of which can be solved with a smaller, more tractable pattern matching step.
Transformers are not the only LLM architecture. Other architectures, especially ones that hold a great deal of internal state that is not closely constrained to text tokens, might behave quite differently.
One thing we should look for to detect abstract reasoning is cases where abstract reasoning leads to errors. The classic examples are things like "Birds can fly, a penguin is a kind of bird, therefore a penguin flies." Abstract reasoning works by discarding most (or all) of the context and applying an abstract rule to draw a conclusion. This is dangerous, of course, and can lead to faulty conclusions. I suspect that human inference includes a third step, checking the conclusion for plausibility by referring back to what is known about the context. I guess we would need to create novel contexts where the AI system lacks (pre-training) experience in order to detect these kinds of errors.
Thanks for a very insightful and detailed discussion on the cognitive ability and in some sense a step towards understanding what creativity means in the context of LLM/GPT.
I am not sure what your thoughts are on this (or you know of any discussions related to this: on reasoning and creativity, I think these are very important and practical questions. This is because as we deal with legality of patents and copyrights, the central questions also evolve about what are the intellectual capabilities of the AI. So these are no longer academic or just-philosophical questions but issues of which will affect how AI legislations and impacts will be.
Really appreciated this clear discussion, thank you!
As with most of AI discussions these days, we get so caught up in the glitzy outputs we don’t stop to define our terms or think robustly about what is required to test. Thank you for helping cut through the noise!
The discourse around the reasoning abilities of LLMs is indeed multi-faceted. While the emergence of CoT prompting has unveiled certain latent capabilities in these models, the depth of true reasoning versus sophisticated pattern recognition remains an open question. The studies you mentioned indeed hint at a more complex interplay between memorization and reasoning, showcasing not just the strides made in AI development but also the intricate pathway that lies ahead in achieving genuine artificial general intelligence.
Sep 11, 2023·edited Sep 12, 2023Liked by Melanie Mitchell
I’m interested in the question of an LLM “memorizing” something. When I prompt ChatGPT with “To be or not to be”–nothing more–it returns the entire soliloquy along with a bit of commentary. Given that it was trained to predict the next word, what must have been the case for it to be able to return that entire soliloquy, word for word?
Shakespeare’s “Hamlet” is a well-known play and likely appeared many times in the training corpus. That soliloquy is also well-known and probably appeared many times independently of the play itself, along with commentary. Given that GPT encountered that particular string many times, it makes sense that it should have “memorized” it, whatever that means.
Now, what happens when you prompt it with a phrase from the soliloquy. I opened a new session and prompted it with “The insolence of office”. It returned pretty much the entire soliloquy. “The slings and arrows” (another new session) returned the first five lines of the soliloquy (it begins the third line.) Then I prompted it with “and sweat under a,” (new session) which is from the middle of a line a bit past the middle. ChatGPT didn’t recognize it, but then did so when I told it that is was from Hamlet’s famous soliloquy. I think this is worth further exploring – prompting with phrases from various locations – but haven’t done so yet.
Then we have the rather different situation of things which are (likely to have been) in the training corpus, but do not show up when prompted for. Years ago I heard Dizzy Gillespie play at Left Bank Jazz Society in Baltimore. I blogged about it twice, both times will before the cut-off date for training. I have no way of knowing whether or not those blog posts were actually in the training corpus, but I gave ChatGPT the following prompt: “Dizzy Gillespie plays for the Left Bank Jazz Society in Baltimore’s Famous Ballroom.” It didn’t recognize the event and gave a confused reply. I then named the two blogs where I’d posted about the concert. Again, nothing.
Since I don’t know whether or not those blog posts were actually in the training corpus, I don’t know quite what to think. But, my default belief at this time is the there’s a bunch of stuff in the corpus that never shows up in response to prompting because the events only didn’t appear very often in the corpus.
Between those two cases we have something like the Johnstown flood of 1889, a historical event of some moderate importance, but certainly not as prominent as, say, the bombing of Pearl Harbor. I prompted ChatGPT with “Johnstown flood, 1889,” and got a reasonable response. (Having grown up in Johnstown, I know something about that flood.) I issued the same prompt at a different session and again got a reasonable response, but one that was different from the first.
I’ve now prompted ChatGPT with thirteen (13) snippets (I won’t call them phrases becasuse, technically, many of them are not phrases, just strings of words), four (4) from line beginnings, and nine (9) from somewhere in the interior of a line. It correctly located all of line-initial snippets, through responded to them in various ways. It only identified two (2) correctly. In one of those cases, the snippet in question, “what dreams may come,” has a use outside the play, which ChatGPT points out.
It responded in various ways to the snippets it was unable to identify, in some cases offering fairly elaborate intrepretive commentary. In the two cases where it correctly located the snippet it also quoted enough of the soliloquy to establish context.
And then there’s the peculiar case of this prompt: “make cowards of us all.” It is from one of the best-known lines in the play, one often quoted on its own: “Thus conscience does make cowards of us all.” I expected the Chatster to identify it. But it did not. So I decided to help it a bit.
I opened a new session and prompted it with: “does make cowards of us all.” The addition of that one word, “does” was all the Chatster needed. It quoted most of the soliloquy in response.
On the whole, I find this is satisfying. For what it’s worth, the fact that ChatGPT would be able to identify snippets from the beginning of a line, but not snippets from the interior, accords well with my intuitions about human psychology. I am an experienced musician – yes, a different medium, but one where serial order is important – and line beginnings are privileged loci. If, during practice or rehearsal, you are going to go over something again, perhaps several times, you’re likely to start at the beginning of a line, not the interior. The same is true when playing a tune “from memory.” You can’t start at any point in the sequence of notes. You have to start at an “access point.” If you know the tune well, it may have several access points for you, generally at a structural boundary. If not, you may only be able to access the tune from the beginning.
We know that in humans memory is not a passive process, like making a tape recording. It is an active process. It has as structure. That seems to be the case for ChatGPT as well. What mechanisms in the model allow it to do this?
I agree, it's in the statistics, starting with ngrams where N = 2 to [just what?].
"I have caught it red-handed writing hundreds lines of my own code..." What has it even "memorized" your code?
But I agree with your larger point, people need to learn to think in terms of character strings in addition to meaning.
I've just been experimenting with prompts from Lincoln's Gettyburg Address. When I give it prompts that are from the beginnings of lines, I always identifies them as coming from that text. When I give it prompts of arbitrary strings that run across syntactic boundaries it never identifies the text. Then, however, when I tell it: It's from a well-known speech, it is able to identify the speech. I just ran up a post about this: https://new-savanna.blogspot.com/2023/09/entry-points-into-memory-stream.html
What an enlightning post! In the same vein look at a recent talk by Evangelina Fedorenko from MIT (https://bit.ly/40TDF9I) based on direct brain observations (EEG, fRMI, etc.). She shows that language and thought are only weakly overlapped even in the human brain.
First of all, I would like to thank you for having periodically rekindled my interest in maths and computer science - each one of your three technical books (Genetic Algorithms, Analogy making and Complexity) made me reëvaluate entire fields, and your SFI lectures are my go-to suggestion for whoever shows an interest in modeling biological or physical systems.
Because of this, I have spent the year preceding last January hoping that our opinions would somehow converge - trust me when I say that, from the start, my default stance had been: "ok, i got it wrong. let's see how".
Unfortunately, that search proved fruitless. I have seen incredible behaviours emerging from LLMs and today, looking through your examples, I perceive something that in pretty much anyone else I would be drawn to chalk up to confirmation bias. Please bear with me - if you object to any of my claims and observations, there's an offer at the end.
1. Whose human level?
As you note at the end, most of the examples could be chalked up to perfectly humdrum phenomena, common to humans as well as machine (more salient and common connections are easier to reason with; arbitrarily changing some syntax rules in a programming language leaves less short-term memory to solve the problem).
On the other hand, you underline how, unlike machines, "humans are (at least in some cases) capable of abstract, content-independent reasoning". I found this puzzling: you have yourself highlighted how the performance of the models was surely better than chance - I would wager, better than the median human's results. That must count as "at least in come cases".
("if given enough time", by the way, is the operative phrase here - and could help explain how come that, while the "reasoning steps" seem tacked on, their presence still leads to a higher chance of correct responses).
2. Whatever happened to analogies?
It seems to me that the current systems excel in them, and just two years ago analogies based on deep, complex structures - “to discover insightful analogies, and to do so in a psychologically realistic way.”, - were still an acceptable benchmark, and one that was likely unattainable. What changed? Or, if you're still of the same advise, what kind of demonstration would change your mind?
Speaking of which, my offer is as follows.
If you could provide me with a problem that could be solved in an analogical* fashion - or a set thereof, or even pointers towards a class of such problems - which, if solved by one current SOTA LLM, would change your mind on the matter, I would be more than happy to provide a rigorous system able to produce a replicable solution.
Mostly, I think LLMs are an amazing tool for expanding one's space of possibility, and if there's anything I could do to convince you to give them a fair chance, that would be a surefire way for me to secure a ticket to consequentialist heaven.
Thank you again so much for all you've done!
Lumps
* ie, not algebraic/formal - way: something that a brilliant, sensitive and intuitive humanities student with nothing but rudiments of college math and computer science could solve
Thank you! The multimodal I/O required might be a stumbling block; would you be convinced by research rendering these tasks as Bongard problem-like challenges - ie, the goal would be for the system to identify the transformations that two sets of objects went through?
```I have seen incredible behaviours emerging from LLMs and today, looking through your examples, I perceive something that in pretty much anyone else I would be drawn to chalk up to confirmation bias. ```
Projection much?
I have seen countless and obvious failures emerging from state-of-the-art LLMs. Obvious self-contradictions within the scope of a single answer. Answers that entirely depend on the wording, rather than meaning of the question. Failures to understand basic sentences. Complete "forgetting" of some aspects of the previous prompt after just one extra interaction. If all this stuff doesn't indicate to you that there is a serious problem with assuming LLMs "reason", then you operate within a frame of reference where that's an unfalsifiable assumption.
"Reasoning" of those systems is not something a rational person should take for granted and demand evidence to the contrary. At best, it's a hypothesis in need of careful testing.
"leading to the hypothesis that LLMs do not perform robust abstract reasoning to solve problems, but instead solve problems (at least in part) by identifying patterns in their training data that match, or are similar to, or are otherwise related to the text of the prompts they are given." This would be obvious. LLMs learn on training data, which is all the data they have access to. They can't form original ideas. When asked a question, they search their known corpus for a match. The order in which they search, find results, then use those results depends in large part on the human instructions. LLMs do not "think" in any way. They are a new form of structured data that users need to know how to search properly in order to get usable results.
Thanks for this very clear exposition. Talking as if you are reasoning is different from reasoning. To claim that a system that is known to work by predicting the next word is reasoning is an extraordinary claim, but,as you point out, there is little critical evaluation. It should require extraordinary evidence. The public, the governments, and some computer scientists are being bamboozled into thinking that these models do things that they are incapable of doing without considering the possibility that they are just following the statistical language patterns that they have learned. This should be just basic science, but evident,y not. Your critical thought is essential in the medium and long run. Thanks!
Thank you!
What is your model of cognition? Are you positive it doesn't involve predicting the next action based on previous states and external input, deep down - except the inputs and the objects of your actions are more highly dimensional than text?
Most models of cognition encompass both rule-based and what you might call automatic prediction-based behavior. In such models, the system first learns rules that it can execute slowly but successfully. With practice, these rules are converted into fast, automatic predictive procedures. Many people have noted that LLMs are trained only to produce the fast, automatic predictive procedure. This is why it would be very surprising if they have also learned the general rules, since no part of the architecture appears to support rule-based reasoning.
What is the minimal demonstration that would change your mind?
See my separate comment on the main post. There is no single minimal demonstration, as I would like to see many pieces of convergent evidence. I'm very interested in the mechanistic interpretation work, which seeks to uncover the internal mechanisms of these models. If there are internal mechanisms that can effectively behave like binding a constant to a variable in an abstract rule, that would be very exciting. In the human brain, the pre-frontal cortex recognizes when the human is in a novel situation where only rule-based reasoning can be applied. Finding a similar mechanism in an LLM would also be extremely interesting.
then it might be easier than i thought! i'd like to hear your take on this paper:
https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world
That paper addresses a different question, which is whether LLMs construct mental models of situations. While mental models are important, they do not correspond to abstract reasoning but instead support efficient "concrete" reasoning. The evidence in that paper is suggestive, but not conclusive.
From OthelloGPT paper “The main takeaway is that Othello-GPT does far better than chance in predicting legal moves when trained on both datasets. ”
It still struggles with legal moves.
It may have a model for Othello, but an approximate, or wrong, model.
This comment is a step in the right direction of discussions that we, the community interested in artificial intelligence, ought to have. Cognition does "involve predicting the next action." The better question is: "Is predicting the next word sufficient to produce other cognitive processes?" A claim that an LLM uses reasoning, or thinking, or is sentient is a claim that it has certain cognitive properties. A language model can say that it is sentient, for example, but is that evidence that it is? Two subquestions: Is language sufficient to implement cognitive processes (such as reasoning)? Is the behavior sufficient to demonstrate these processes? Mitchell cites some articles that show that small changes in the language of a question yield different patterns of results. Some words demonstrate apparent presence of a property, but closely related words do not. There are lots of these examples. What I was mentioning was that even if a model behaves as if it had a cognitive process, that alone does not allow one to conclude that it does have that property. Affirming the consequent is a legal fallacy. Consider a situation "If X then Y," "If a model has reasoning, then it will solve reasoning problems." Observe "Y," observe the model solving reasoning problems. We would be wrong to conclude that this observation demonstrates that the model has reasoning, there could be another cause, such as being able to repeat the right words. If Lincoln was killed by a robot, then he is dead. Lincoln is dead. It would erroneous, however, to conclude the Lincoln was killed by a robot. He was killed by a human, but he is still dead. Affirming the consequent is the dominant means by which people try to demonstrate that models have some cognitive capacity, but we also have to consider alternative potential causes. Actors may say lines as if they were mathematical geniuses without being, in fact, geniuses. Sounding like a genius is not the same as being a genius.
"Reasoning" is the application of knowledge to a problem statement that results in making certain information explicit, namely, an answer that was always there, it just wasn't written down yet.
If the knowledge permits derivation of the result in one step of pattern matching---equivalent to constraint satisfaction---then we can expect many architectures to work. If instead, multiple steps are required, then the problem becomes one of search. Search requires keeping track of intermediate results, along with info to navigate the search space.
Where might an LLM hold this information?
In a transformer, in order to parse and emit natural language, the early and late layers probably must attend primarily to lexical and syntactic matters. Presumably the middle layers are the ones that can afford to represent semantics, including various forms of abstraction. Although, the activation vectors must share different time scales of information as the residual stream gets modified: local linguistic patterns must be carried from input to output, but through superposition, activations in the middle layers also carry longer range pressures and constraints associated with categories and manifolds underlying the structure of the problem domain and the problem statement.
Does a transformer LLM have enough room to hold intermediate steps of a complex reasoning task?
The amount of activation vector capacity is the depth of the network times the number of tokens in play. Chain-of-Thought expands capacity by allocating tokens in the context vector to *procedural steps* in the reasoning process. Instead of the model having to internally shoehorn placeholders for its location in the search tree into a short context vector consisting of the problem statement and a compact output, it now has the luxury of seeing where it is in the search space there in writing (output tokens representing steps along the chain of thought). And crucially, intermediate results are made explicit in the context vector, which allows subdivision of a large and complex reasoning process into smaller parts, each of which can be solved with a smaller, more tractable pattern matching step.
Transformers are not the only LLM architecture. Other architectures, especially ones that hold a great deal of internal state that is not closely constrained to text tokens, might behave quite differently.
One thing we should look for to detect abstract reasoning is cases where abstract reasoning leads to errors. The classic examples are things like "Birds can fly, a penguin is a kind of bird, therefore a penguin flies." Abstract reasoning works by discarding most (or all) of the context and applying an abstract rule to draw a conclusion. This is dangerous, of course, and can lead to faulty conclusions. I suspect that human inference includes a third step, checking the conclusion for plausibility by referring back to what is known about the context. I guess we would need to create novel contexts where the AI system lacks (pre-training) experience in order to detect these kinds of errors.
Thanks for this insightful post.
"Take a deep breath and work step-by-step!" is now more effective in some tasks than "Let’s think step by step". https://arxiv.org/pdf/2309.03409.pdf
Seems the teams at deepmind were right all along. The same patterns of -content effects- affect both humans and LLMs.
Thanks for a very insightful and detailed discussion on the cognitive ability and in some sense a step towards understanding what creativity means in the context of LLM/GPT.
I am not sure what your thoughts are on this (or you know of any discussions related to this: on reasoning and creativity, I think these are very important and practical questions. This is because as we deal with legality of patents and copyrights, the central questions also evolve about what are the intellectual capabilities of the AI. So these are no longer academic or just-philosophical questions but issues of which will affect how AI legislations and impacts will be.
Did you catch the paper- The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python?
That would have made a perfect inclusion in this article
https://arxiv.org/abs/2305.15507
Really appreciated this clear discussion, thank you!
As with most of AI discussions these days, we get so caught up in the glitzy outputs we don’t stop to define our terms or think robustly about what is required to test. Thank you for helping cut through the noise!
The discourse around the reasoning abilities of LLMs is indeed multi-faceted. While the emergence of CoT prompting has unveiled certain latent capabilities in these models, the depth of true reasoning versus sophisticated pattern recognition remains an open question. The studies you mentioned indeed hint at a more complex interplay between memorization and reasoning, showcasing not just the strides made in AI development but also the intricate pathway that lies ahead in achieving genuine artificial general intelligence.
Exciting times ahead!
I truly value this Informative article, thank you! Have featured your article in our today's newsletter. :)
That's a wonderful summary of the SoTA, Melanie. I have written about the supposed "emergent" features here:
https://manlius.substack.com/p/navigating-the-transformative-potential
providing a perspective from complexity science.
I’m interested in the question of an LLM “memorizing” something. When I prompt ChatGPT with “To be or not to be”–nothing more–it returns the entire soliloquy along with a bit of commentary. Given that it was trained to predict the next word, what must have been the case for it to be able to return that entire soliloquy, word for word?
Shakespeare’s “Hamlet” is a well-known play and likely appeared many times in the training corpus. That soliloquy is also well-known and probably appeared many times independently of the play itself, along with commentary. Given that GPT encountered that particular string many times, it makes sense that it should have “memorized” it, whatever that means.
Now, what happens when you prompt it with a phrase from the soliloquy. I opened a new session and prompted it with “The insolence of office”. It returned pretty much the entire soliloquy. “The slings and arrows” (another new session) returned the first five lines of the soliloquy (it begins the third line.) Then I prompted it with “and sweat under a,” (new session) which is from the middle of a line a bit past the middle. ChatGPT didn’t recognize it, but then did so when I told it that is was from Hamlet’s famous soliloquy. I think this is worth further exploring – prompting with phrases from various locations – but haven’t done so yet.
Then we have the rather different situation of things which are (likely to have been) in the training corpus, but do not show up when prompted for. Years ago I heard Dizzy Gillespie play at Left Bank Jazz Society in Baltimore. I blogged about it twice, both times will before the cut-off date for training. I have no way of knowing whether or not those blog posts were actually in the training corpus, but I gave ChatGPT the following prompt: “Dizzy Gillespie plays for the Left Bank Jazz Society in Baltimore’s Famous Ballroom.” It didn’t recognize the event and gave a confused reply. I then named the two blogs where I’d posted about the concert. Again, nothing.
Since I don’t know whether or not those blog posts were actually in the training corpus, I don’t know quite what to think. But, my default belief at this time is the there’s a bunch of stuff in the corpus that never shows up in response to prompting because the events only didn’t appear very often in the corpus.
Between those two cases we have something like the Johnstown flood of 1889, a historical event of some moderate importance, but certainly not as prominent as, say, the bombing of Pearl Harbor. I prompted ChatGPT with “Johnstown flood, 1889,” and got a reasonable response. (Having grown up in Johnstown, I know something about that flood.) I issued the same prompt at a different session and again got a reasonable response, but one that was different from the first.
I’ve written this up in a blog post: https://new-savanna.blogspot.com/2023/09/what-must-be-case-that-chatgpt-would.html
I've continued with these experiments.
I’ve now prompted ChatGPT with thirteen (13) snippets (I won’t call them phrases becasuse, technically, many of them are not phrases, just strings of words), four (4) from line beginnings, and nine (9) from somewhere in the interior of a line. It correctly located all of line-initial snippets, through responded to them in various ways. It only identified two (2) correctly. In one of those cases, the snippet in question, “what dreams may come,” has a use outside the play, which ChatGPT points out.
It responded in various ways to the snippets it was unable to identify, in some cases offering fairly elaborate intrepretive commentary. In the two cases where it correctly located the snippet it also quoted enough of the soliloquy to establish context.
And then there’s the peculiar case of this prompt: “make cowards of us all.” It is from one of the best-known lines in the play, one often quoted on its own: “Thus conscience does make cowards of us all.” I expected the Chatster to identify it. But it did not. So I decided to help it a bit.
I opened a new session and prompted it with: “does make cowards of us all.” The addition of that one word, “does” was all the Chatster needed. It quoted most of the soliloquy in response.
On the whole, I find this is satisfying. For what it’s worth, the fact that ChatGPT would be able to identify snippets from the beginning of a line, but not snippets from the interior, accords well with my intuitions about human psychology. I am an experienced musician – yes, a different medium, but one where serial order is important – and line beginnings are privileged loci. If, during practice or rehearsal, you are going to go over something again, perhaps several times, you’re likely to start at the beginning of a line, not the interior. The same is true when playing a tune “from memory.” You can’t start at any point in the sequence of notes. You have to start at an “access point.” If you know the tune well, it may have several access points for you, generally at a structural boundary. If not, you may only be able to access the tune from the beginning.
We know that in humans memory is not a passive process, like making a tape recording. It is an active process. It has as structure. That seems to be the case for ChatGPT as well. What mechanisms in the model allow it to do this?
This post gives a complete record of the sessions: https://new-savanna.blogspot.com/2023/09/to-be-or-not-snippets-from-soliloquy.html
I agree, it's in the statistics, starting with ngrams where N = 2 to [just what?].
"I have caught it red-handed writing hundreds lines of my own code..." What has it even "memorized" your code?
But I agree with your larger point, people need to learn to think in terms of character strings in addition to meaning.
I've just been experimenting with prompts from Lincoln's Gettyburg Address. When I give it prompts that are from the beginnings of lines, I always identifies them as coming from that text. When I give it prompts of arbitrary strings that run across syntactic boundaries it never identifies the text. Then, however, when I tell it: It's from a well-known speech, it is able to identify the speech. I just ran up a post about this: https://new-savanna.blogspot.com/2023/09/entry-points-into-memory-stream.html
Best overview on the subject of reasoning to date! Great read
What an enlightning post! In the same vein look at a recent talk by Evangelina Fedorenko from MIT (https://bit.ly/40TDF9I) based on direct brain observations (EEG, fRMI, etc.). She shows that language and thought are only weakly overlapped even in the human brain.
I know some LLMs that can reason a lot better than some humans I know
Hello Melanie,
First of all, I would like to thank you for having periodically rekindled my interest in maths and computer science - each one of your three technical books (Genetic Algorithms, Analogy making and Complexity) made me reëvaluate entire fields, and your SFI lectures are my go-to suggestion for whoever shows an interest in modeling biological or physical systems.
Because of this, I have spent the year preceding last January hoping that our opinions would somehow converge - trust me when I say that, from the start, my default stance had been: "ok, i got it wrong. let's see how".
Unfortunately, that search proved fruitless. I have seen incredible behaviours emerging from LLMs and today, looking through your examples, I perceive something that in pretty much anyone else I would be drawn to chalk up to confirmation bias. Please bear with me - if you object to any of my claims and observations, there's an offer at the end.
1. Whose human level?
As you note at the end, most of the examples could be chalked up to perfectly humdrum phenomena, common to humans as well as machine (more salient and common connections are easier to reason with; arbitrarily changing some syntax rules in a programming language leaves less short-term memory to solve the problem).
On the other hand, you underline how, unlike machines, "humans are (at least in some cases) capable of abstract, content-independent reasoning". I found this puzzling: you have yourself highlighted how the performance of the models was surely better than chance - I would wager, better than the median human's results. That must count as "at least in come cases".
("if given enough time", by the way, is the operative phrase here - and could help explain how come that, while the "reasoning steps" seem tacked on, their presence still leads to a higher chance of correct responses).
2. Whatever happened to analogies?
It seems to me that the current systems excel in them, and just two years ago analogies based on deep, complex structures - “to discover insightful analogies, and to do so in a psychologically realistic way.”, - were still an acceptable benchmark, and one that was likely unattainable. What changed? Or, if you're still of the same advise, what kind of demonstration would change your mind?
Speaking of which, my offer is as follows.
If you could provide me with a problem that could be solved in an analogical* fashion - or a set thereof, or even pointers towards a class of such problems - which, if solved by one current SOTA LLM, would change your mind on the matter, I would be more than happy to provide a rigorous system able to produce a replicable solution.
Mostly, I think LLMs are an amazing tool for expanding one's space of possibility, and if there's anything I could do to convince you to give them a fair chance, that would be a surefire way for me to secure a ticket to consequentialist heaven.
Thank you again so much for all you've done!
Lumps
* ie, not algebraic/formal - way: something that a brilliant, sensitive and intuitive humanities student with nothing but rudiments of college math and computer science could solve
Thanks for your comments! You might find my earlier post on the ARC challenge interesting as an analogy domain that current AI systems still have trouble with. See https://aiguide.substack.com/p/why-the-abstraction-and-reasoning
Thank you! The multimodal I/O required might be a stumbling block; would you be convinced by research rendering these tasks as Bongard problem-like challenges - ie, the goal would be for the system to identify the transformations that two sets of objects went through?
```I have seen incredible behaviours emerging from LLMs and today, looking through your examples, I perceive something that in pretty much anyone else I would be drawn to chalk up to confirmation bias. ```
Projection much?
I have seen countless and obvious failures emerging from state-of-the-art LLMs. Obvious self-contradictions within the scope of a single answer. Answers that entirely depend on the wording, rather than meaning of the question. Failures to understand basic sentences. Complete "forgetting" of some aspects of the previous prompt after just one extra interaction. If all this stuff doesn't indicate to you that there is a serious problem with assuming LLMs "reason", then you operate within a frame of reference where that's an unfalsifiable assumption.
"Reasoning" of those systems is not something a rational person should take for granted and demand evidence to the contrary. At best, it's a hypothesis in need of careful testing.
you might want to read my comment til the end.
"leading to the hypothesis that LLMs do not perform robust abstract reasoning to solve problems, but instead solve problems (at least in part) by identifying patterns in their training data that match, or are similar to, or are otherwise related to the text of the prompts they are given." This would be obvious. LLMs learn on training data, which is all the data they have access to. They can't form original ideas. When asked a question, they search their known corpus for a match. The order in which they search, find results, then use those results depends in large part on the human instructions. LLMs do not "think" in any way. They are a new form of structured data that users need to know how to search properly in order to get usable results.