Discussion about this post

User's avatar
Benjamin P Rode's avatar

A couple of preliminary thoughts, for what they’re worth.

Hofstadter’s string completion examples constitute a very rarified case of reasoning by analogy, with the task reduced to completion of a syntactic or iconographic pattern as per a functionally defined mapping between patterns. In some sense, that was the point, but this compels the question of how much even locally scalable performance on such bare-essence examples is really telling us. As deployed in vivo, as opposed to in vitro, metaphors and analogies are not arbitrary inventions of the moment. They usually serve the purpose of furthering learning and understanding on the basis of historical convention and use preceden (for example, if one knows the account of the Trojan Horse, one can, on being told that the firewall of a LAN is analogous to the walls of Troy in a Trojan horse malicious code exploit, infer that the aim of the exploit is to transition malware through the firewall inside of an innocuous wrapper). One wonders if the ability of LLM to learn by way of this kind of metaphorical induction could be tested. Here, I think we run up against a very fundamental problem, at least in the case of the GPTs. As regards the example, consider that the concept of a ‘trojan horse’ computer virus or malicious exploit is very likely to be at least as well represented in the original training data as the Greek myth. The only way to be 100% sure that what appeared to be learning really was learning and not reliance on innate knowledge would be to test with a model whose original training set had been ablated of descriptions of the exploit type. Given that fine-grained experimental control over pretraining is out of scope thanks to OpenAIs reluctance to allow outsiders access or insight into this phase of the process, even very extensive performance testing of generalization and sample efficiency would seem to have only-slightly-better-than-anecdotal value. Is apparently novel output being synthetically induced, or was it in the knowledge base already? The failure to replicate Webb’s original result may not be irrelevant, here: as long as the GPTs are moving targets and their pretraining and infrastructure remain opaque, doing bona-fide science with reproducible results is likely to prove frustrating.

The fact that the latest versions of GPT can deploy indexical code in support of problem solving is in fact quite impressive; and, modulo your point that humans seem to rely on a much vaguer sense of ‘successorship’, in fact, the possibility that our cortices include the equivalent of neural network transformers for doing some sort of scratchpad list processing is not a hypothesis I’d be prepared to exclude out of hand. However, I have a couple of questions. First, did the LLM deploy the code spontaneously (‘zero-shot’, as it were), or did it have to be explicitly told, via some form of prompt engineering or fine tuning, that this had to be done to solve the problem? And in that case, how explicit did the instruction need to be, and how many example cases did it use? Second, did any of the permuted alphabet testing extend to cases involving principles of idempotence and composition? For example, consider the question ‘A B C D is to A B C __ as what is to ZYXW VUTS RQPO ___?’ (I would take the answer to be ‘NMLK’). To be sure, it’s to be expected that human analogical invention would also be outstripped at some point, but it might be interesting to determine whether LLM performance degrades along the same curve, as opposed to being better or worse. One has the sense that the contextualized evaluation of prior output is the locus of persistent problems for the current generation of neural network transformer architectures, and that the expressivity of requisite formalisms may have something to do with this.

Expand full comment
Bill Benzon's avatar

I find this very interesting. I've done a fair amount of informal work on ChatGPT's ability to deal with analogical reasoning. Here's a long post presenting my most recent thinking on that work: Intelligence, A.I. and analogy: Jaws & Girard, kumquats & MiGs, double-entry bookkeeping & supply and demand, https://new-savanna.blogspot.com/2024/05/intelligence-ai-and-analogy-jaws-girard.html

The first part of the post deals with ChatGPT's ability to interpret a film (Jaws). That requires setting up an analogy between events in the film and stylized situations as characterized by some interpretive framework. ChatGPT has some capacity to this. I've tested it on other examples as well. The third and last section concerns the history of economics. Following remarks by an economist, Tyler Cowen, I hypothesized the perhaps economists we first able to conceptualize the phenomenon of supply and demand by analogizing it to double-entry book-keeping. I asked ChatGPT to explicate that analogy, which it did quite well. Of course, it is one thing to explicate an analogy you've been given, and something else to come up with a (useful) analogy in the first place.

In between those two discussions I explain what happened with I asked ChatGPT to explain the following analogies:

A kitten and a bicycle.

A bicycle and a food blender.

A kitten and a clock.

A Jack-in-the-box and the moon.

A clock and a sea anemone.

A garbage heap and a submarine.

A submarine and a gourmet Chinese banquet.

A helicopter and a Beethoven piano sonata.

A novel by Tolstoy and a MiG fighter jet.

A three-dimensional matrix and the Christian Trinity.

I came up with a "coherent" account of every one, despite the fact that they're nonsensical. How do we distinguish between a reasonable analogy and a nonsensical one?

Expand full comment
13 more comments...

No posts