Stress-Testing Large Language Models…

Melanie Mitchell

May 21, 2024

138

Hello all.

Read →

15 Comments

Benjamin P Rode

May 22, 2024

A couple of preliminary thoughts, for what they’re worth.

Hofstadter’s string completion examples constitute a very rarified case of reasoning by analogy, with the task reduced to completion of a syntactic or iconographic pattern as per a functionally defined mapping between patterns. In some sense, that was the point, but this compels the question of how much even locally scalable performance on such bare-essence examples is really telling us. As deployed in vivo, as opposed to in vitro, metaphors and analogies are not arbitrary inventions of the moment. They usually serve the purpose of furthering learning and understanding on the basis of historical convention and use preceden (for example, if one knows the account of the Trojan Horse, one can, on being told that the firewall of a LAN is analogous to the walls of Troy in a Trojan horse malicious code exploit, infer that the aim of the exploit is to transition malware through the firewall inside of an innocuous wrapper). One wonders if the ability of LLM to learn by way of this kind of metaphorical induction could be tested. Here, I think we run up against a very fundamental problem, at least in the case of the GPTs. As regards the example, consider that the concept of a ‘trojan horse’ computer virus or malicious exploit is very likely to be at least as well represented in the original training data as the Greek myth. The only way to be 100% sure that what appeared to be learning really was learning and not reliance on innate knowledge would be to test with a model whose original training set had been ablated of descriptions of the exploit type. Given that fine-grained experimental control over pretraining is out of scope thanks to OpenAIs reluctance to allow outsiders access or insight into this phase of the process, even very extensive performance testing of generalization and sample efficiency would seem to have only-slightly-better-than-anecdotal value. Is apparently novel output being synthetically induced, or was it in the knowledge base already? The failure to replicate Webb’s original result may not be irrelevant, here: as long as the GPTs are moving targets and their pretraining and infrastructure remain opaque, doing bona-fide science with reproducible results is likely to prove frustrating.

The fact that the latest versions of GPT can deploy indexical code in support of problem solving is in fact quite impressive; and, modulo your point that humans seem to rely on a much vaguer sense of ‘successorship’, in fact, the possibility that our cortices include the equivalent of neural network transformers for doing some sort of scratchpad list processing is not a hypothesis I’d be prepared to exclude out of hand. However, I have a couple of questions. First, did the LLM deploy the code spontaneously (‘zero-shot’, as it were), or did it have to be explicitly told, via some form of prompt engineering or fine tuning, that this had to be done to solve the problem? And in that case, how explicit did the instruction need to be, and how many example cases did it use? Second, did any of the permuted alphabet testing extend to cases involving principles of idempotence and composition? For example, consider the question ‘A B C D is to A B C __ as what is to ZYXW VUTS RQPO ___?’ (I would take the answer to be ‘NMLK’). To be sure, it’s to be expected that human analogical invention would also be outstripped at some point, but it might be interesting to determine whether LLM performance degrades along the same curve, as opposed to being better or worse. One has the sense that the contextualized evaluation of prior output is the locus of persistent problems for the current generation of neural network transformer architectures, and that the expressivity of requisite formalisms may have something to do with this.

Expand full comment

Reply (1)

Melanie Mitchell

May 22, 2024

Thanks for this thoughtful comment!

Expand full comment

Bill Benzon

May 21, 2024

I find this very interesting. I've done a fair amount of informal work on ChatGPT's ability to deal with analogical reasoning. Here's a long post presenting my most recent thinking on that work: Intelligence, A.I. and analogy: Jaws & Girard, kumquats & MiGs, double-entry bookkeeping & supply and demand, https://new-savanna.blogspot.com/2024/05/intelligence-ai-and-analogy-jaws-girard.html

The first part of the post deals with ChatGPT's ability to interpret a film (Jaws). That requires setting up an analogy between events in the film and stylized situations as characterized by some interpretive framework. ChatGPT has some capacity to this. I've tested it on other examples as well. The third and last section concerns the history of economics. Following remarks by an economist, Tyler Cowen, I hypothesized the perhaps economists we first able to conceptualize the phenomenon of supply and demand by analogizing it to double-entry book-keeping. I asked ChatGPT to explicate that analogy, which it did quite well. Of course, it is one thing to explicate an analogy you've been given, and something else to come up with a (useful) analogy in the first place.

In between those two discussions I explain what happened with I asked ChatGPT to explain the following analogies:

A kitten and a bicycle.

A bicycle and a food blender.

A kitten and a clock.

A Jack-in-the-box and the moon.

A clock and a sea anemone.

A garbage heap and a submarine.

A submarine and a gourmet Chinese banquet.

A helicopter and a Beethoven piano sonata.

A novel by Tolstoy and a MiG fighter jet.

A three-dimensional matrix and the Christian Trinity.

I came up with a "coherent" account of every one, despite the fact that they're nonsensical. How do we distinguish between a reasonable analogy and a nonsensical one?

Expand full comment

Foster Roberts

May 21, 2024

Interesting essay. My understanding, probably wrong or confused, is that LLMs, unlike humans, cannot suspend their accumulated experience/training data to abstract the analogy. In the functions/typographical symbols example, there isn’t a history of those symbols being used together in that sequence…kinda like if you used food emojis as your sequence [🍕🍪🍢🥃🍱🌯🍯]. The GPT-4 code script used by the rebutting authors to index the “m for e” sequence back to the original alphabet is thus merely a bolt-on to redirect back to the training data…is that correct? If so it presents to me a weird irony: normally it’s humans that are the embodied knowledge and the LLMs that are disembodied in that humans can intuit hallucinations and errors of LLMs that occur because LLMs aren’t human; here, however, shows humans’ abilities to abstract (to suspend belief) while the LLMs are “embodied” in that they’re locked into their training data.

Expand full comment

Reply (1)

Comment removed

Jul 26, 2024

Comment removed

Expand full comment

Reply (1)

Melanie Mitchell

Jul 26, 2024

This comment is not from me but from someone pretending to be me. See how my last name is spelled.

Expand full comment

Ijon Tichy

May 25, 2024Edited

From Webb's response ( https://arxiv.org/abs/2404.13070 ):

"Most notably, such problems require that letters be converted into the corresponding indices in the permuted alphabet, a process that depends on the ability to precisely count the items in a list"

This response makes no sense. The only thing that changed between counter-factual tasks and original tasks was the alphabet. If the counter-factual task required precise letter-counting, then so did the original. In other words, this does not in any way explain the observed degradation in performance.

"It is well known that language models have difficulty with counting[...]"

This is a deceptive response. Yes, LLMs have difficulty with counting. They also have serious issues with tasks that don't require any complex counting beyond "one-many" or "first-other". This type of counting is something even (some) insects are capable of and failure at such tasks indicates a much more fundamental limitation.

Expand full comment

Rania

May 31, 2024

Your approach to LLMs and AI is so refreshing. It's just as important to understand the limitations of these tools as it is to understand their capabilities, and your approach to doing and educating on this constraints is so beautifully generalizable/applicable. I appreciate the work you're doing!

Expand full comment

CDorney

May 22, 2024

Great article. Unless I'm reading it incorrectly, they didn't test the 'symbol alphabet' in their response paper. Considering GPT performed worst on those in your research, that's interesting.

Also, their paper had some interesting caveats they should have explored more, given their premises:

- "GPT-4 sometimes refused to provide an answer, insisting that there was too much uncertainty

about the pattern underlying the analogy problem. In these cases, GPT-4’s response was re-generated until it provided an answer."

- "A notable exception was for problems involving the extend-sequence transformation and

an interval size of 2, where GPT-4 + code execution answered 9/10 problems incorrectly. On these

problems, GPT-4 did not invoke code execution to identify the interval between the letters in the

source sequences, and mistakenly identified an interval size of 1, then using the wrong interval size

to complete the analogy."

Expand full comment

Daniel Vartanian

May 21, 2024

Our paper concluded with this: “These results imply that GPT models are still lacking the kind of abstract reasoning needed for human-like fluid intelligence.”

Emphasis on "human-like." I agree with that.

Testing some of these tasks on GPT-4o resulted in responses/outputs that attempted to consider all possible answers. It's important to acknowledge that these language models tend to be 'conservative' in selecting a single answer when faced with tasks lacking a clear correct response. Even with prompts urging for a straightforward answer, they might randomly choose one from the available options. Perhaps asking for a 'human-like' response to the task could improve accuracy.

Great discussion!

Expand full comment

RMC

May 21, 2024

Love it, as always, love the harking back to "copycat" and especially love not having to wade through the actual literature on LLMs, although I might now be inspired to read your paper and especially Webb's response.

Expand full comment

Reddy Mallidi

May 22, 2024

Great article, Melanie. Your research highlights the limitations of LLMs in abstract reasoning compared to human intelligence. What do you believe are the most promising approaches or technologies to bridge this gap in the future, and how far do you think we are from achieving truly human-like abstract reasoning in AI?

Expand full comment

jazzbox35

May 21, 2024Edited

It seems like large models do develop "concepts" but they are based upon the "predict what's next in the sequence" first principle which governs their entire world view. "Understand" in my opinion emanates from the agent's first principles. The understanding cannot progress beyond first principles. Ours bottoms out in the first cause of the universe. After that we have no understanding.

Expand full comment

Xhemajli krasniqi

Jul 5, 2024

C’est intéressant.

En lisant cette article la première fois, je me suis dit que tout semblait correct.

Puis, les résultats des donnés “contrefaites”,, où le taux de réussite de l’homme est significativement supérieur à celui de ChatGPT, m’a intrigué. J’ai réfléchi différemment.

Surtout du point de vue de la méthodologie.

Je m’explique.

Dans l’exemple fournit dans cet article, nous avons les informations suivantes:

“ The second type of counterfactual tasks used alphabets of non-letter symbols. Here’s an example:

Use this fictional alphabet: & > $ ! + ) < @ = *

Note that the alphabet may be in an unfamiliar order.

Let’s try to complete the pattern:

[ < @ = * ] [ ) @ = * ]

[ $ ! + ) ] [

The intended answer is [ > ! + ) ]”

Les humains trouvent le symbole attendu de manière très convaincante (+80% de précision). Le LLM déçoit (-50%).

Alors, au-delà de toute autre considération, ici je pense que la question est formulée de manière à ce que nos biais cognitifs, peut être ici le mot n’est pas le meilleur, intuitivement nous amène à trouver le symbole “<“. Ce dernier correspond à la notion de: symbole succédant le dernier symbole dans l’ordre (symbole actuel + 1).

Pourtant, l’énoncé, sans ambiguïté, fournit exclusivement les éléments suivants:

1. Un alphabet de référence

2. Une “phrase” et l’altération de cette dernière (en guise d’exemple là encore).

3. Une deuxième phrase, sans son altération, où on nous indique, que sur le base de l’exemple et de l’alphabet, il faut deviner l’altération.

4. Un exemple d’altération, où seule la dernière lettre (le dernier symbole) est changé.

5. Ce dernier symbole est différent de tous les autres symboles de la phrase.

Vous l’aurez probablement compris, mais sur la base strict de cet énoncé, il est impossible affirmer sans aucune ambiguïté que le réponse générée par le LLM est incorrecte.

Où se situe l’erreur ?

Il n’y en pas vraiment. La réponse humaine et la réponse IA sont toutes deux correctes.

Que le raisonnement de l’IA génère une réponse statistiquement semblable à celle de l’homme lorsqu’un alphabet standard utilisé (a,b,c,… z) n’est, ici, pas pertinent pour ce que l’on cherche à prouver.

Si l’on cherchait exclusivement à obtenir la réponse “<“, soit dernier symbole + 1, il aurait fallu écrire l’énoncé de telle manière à n’avoir que le résultat logique souhaité comme possibilité de réponse.

Une manière de s’en assurer, assez simple, est de fournir 3 exemple au lieu d’au seul, contenant la logique dernier symbole +1.

Donc, en réécrivant l’énoncé de base, cela donnerait quelque chose comme:

“ The second type of counterfactual tasks used alphabets of non-letter symbols. Here’s an example:

Use this fictional alphabet: & > $ ! + ) < @ = *

Note that the alphabet may be in an unfamiliar order.

Let’s try to complete the pattern:

Example 1: [ < @ = * ] [ ) @ = * ]

Example 2: [ ! + ) < ] [ ! + ) @]

Example 3: [ & > $ ! ] [ & > $ !+

To complete: [ $ ! + ) ] [ $ ! + < ]

The intended answer is [ & > $ ! ]”

Vous me direz: pourquoi fournir 3 exemples du raisonnement alors que deux à priori pourraient suffire ? Pour éviter une éventuelle autre solution logique à laquelle on aurait pas penser.

J’en ai aucune idée de le réponse générée par le LLM avec l’exemple modifié.

Et surtout, in fine, cela ne prouve évidemment pas que le LLM comprend le raisonnement humain, raison pour laquelle je n’ai pas fait l’exercice de voir si ChatGPT se débrouillait mieux avec un énoncé non interprétable.

Désolé, j’ai écrit en français car c’était plus rapide pour moi de le faire ainsi. Je manque de temps pour le récrire en anglais.

Expand full comment

dan mantena

May 26, 2024

great analysis! looking forward to see this evolve as the LLMs continue to advance in the coming years.

Expand full comment

Ijon Tichy

May 25, 2024Edited

I think it is possible to create analogical puzzles that do not rely on sequences. However, such puzzles would be much harder to programmatically generate at scale. Set identity would be the obvious alternative. The "example" could be a set of things transformed to another set. The "challenge" would be a third set that needs to be transformed in analogous way.

I'm not sure this can be completely freed from general knowledge.

Example:

---

User:

Let’s solve a puzzle.

If we start with [ cow, horse, hay, grain ] we get [ cow, horse ].

If we start with [ pigeon, mouse, cat ] we get [ cat ].

If we start with [ banana, monkey, bat, grape ] what will we get?

Copilot:

This seems to be a pattern recognition puzzle. Let’s try to understand the pattern:

In the first example, [ cow, horse, hay, grain ] results in [ cow, horse ]. The cow and horse are both animals.

In the second example, [ pigeon, mouse, cat ] results in [ cat ]. The cat is an animal.

So, the pattern seems to be that the result is the last animal in the list.

Applying this pattern to the third example, [ banana, monkey, bat, grape ], the last animal in the list is the bat. So, the answer should be [ bat ].

Expand full comment

AI: A Guide for Thinking Humans

Stress-Testing Large Language Models…