9 Comments

Hi Melanie, thank you for this analysis. I am only now getting into the prior literature of analogical reasoning and generalization although I have used Gentner and Forbus' SME and SAGE systems within a larger cognitive reasoning framework.

I have a few questions about the experiment setup and evaluation:

1. A lot of the patterns don't provide enough information to induce a rule. A single example is not sufficient to determine which rule is the right one to answer a question (you noted it as well). Are there questions in which there are multiple examples that constrain the hypothesis space such that there is no ambiguity? Can humans recognize that there are several applicable rules? Do you think that recognition of multiple hypotheses and asking for more information is a unique human skill?

2. The 'better than humans' claim is typically made by AI scientists who are looking at accuracy metrics. I was surprised to similar claims by psychologists. I can see that several humans were much better than ChanGPT (especially in the story analogy case), the average is lower but the spread is really big. Is it standard to compare averages of human behavior?

Expand full comment

Hi Melanie. Did you try prompts that encourage chain of thought reasoning or self-reflection on the initial answer? It seems like performance on some of these tasks might benefit from that.

Expand full comment

I tried the letter-string analogies and got wrong answer and wrong explanation:

Here is a possible solution to the riddle:

mno -> mnp, ijk -> ilk

The pattern here is to change the last letter of the first word to the next letter in the alphabet, and to change the second letter of the second word to the next letter in the alphabet.

Expand full comment

J'ai lu votre livre en français pour mieux comprendre ce que pouvait faire ou non l'IA, c'était très intéressant.

Je suis content de lire ce blog détaillé. Les posts twitter étaient trop courts.

Expand full comment

Really enjoyed this essay. We have some many articles discussing the technical details of autoregressive language models as well as different metrics to evaluate training performance. However, there are still so many challenges in terms of robust and common-sense benchmarks -- they are just hard to formalize. Anyways, I just wanted to say that this was a very refreshing read!

Expand full comment