Hi Melanie, thank you for this analysis. I am only now getting into the prior literature of analogical reasoning and generalization although I have used Gentner and Forbus' SME and SAGE systems within a larger cognitive reasoning framework.
I have a few questions about the experiment setup and evaluation:
1. A lot of the patterns don't provide enough information to induce a rule. A single example is not sufficient to determine which rule is the right one to answer a question (you noted it as well). Are there questions in which there are multiple examples that constrain the hypothesis space such that there is no ambiguity? Can humans recognize that there are several applicable rules? Do you think that recognition of multiple hypotheses and asking for more information is a unique human skill?
2. The 'better than humans' claim is typically made by AI scientists who are looking at accuracy metrics. I was surprised to similar claims by psychologists. I can see that several humans were much better than ChanGPT (especially in the story analogy case), the average is lower but the spread is really big. Is it standard to compare averages of human behavior?
Good questions. I think to answer #1 more probing is needed than was done in the Webb et al. paper. For #2, I think you're quite right -- it's questionable how to interpret averages of human behavior.
"However, I did try one of my favorites: abc —>abd, xyz —> ? GPT-3 returned the strange answer xye."
I can't help wondering if GPT-3 was applying the “basic successor” three-letter-string pattern to the phonetic spelling of 'z' as 'zee' to arrive at 'e'.
Hi Melanie. Did you try prompts that encourage chain of thought reasoning or self-reflection on the initial answer? It seems like performance on some of these tasks might benefit from that.
No, and I don't think Webb et al. tried this either. Feel free to try & report back. The problem, of course, is that these specific problems may have made it into the training set of these systems.
I tried the letter-string analogies and got wrong answer and wrong explanation:
Here is a possible solution to the riddle:
mno -> mnp, ijk -> ilk
The pattern here is to change the last letter of the first word to the next letter in the alphabet, and to change the second letter of the second word to the next letter in the alphabet.
Really enjoyed this essay. We have some many articles discussing the technical details of autoregressive language models as well as different metrics to evaluate training performance. However, there are still so many challenges in terms of robust and common-sense benchmarks -- they are just hard to formalize. Anyways, I just wanted to say that this was a very refreshing read!
Hi Melanie, thank you for this analysis. I am only now getting into the prior literature of analogical reasoning and generalization although I have used Gentner and Forbus' SME and SAGE systems within a larger cognitive reasoning framework.
I have a few questions about the experiment setup and evaluation:
1. A lot of the patterns don't provide enough information to induce a rule. A single example is not sufficient to determine which rule is the right one to answer a question (you noted it as well). Are there questions in which there are multiple examples that constrain the hypothesis space such that there is no ambiguity? Can humans recognize that there are several applicable rules? Do you think that recognition of multiple hypotheses and asking for more information is a unique human skill?
2. The 'better than humans' claim is typically made by AI scientists who are looking at accuracy metrics. I was surprised to similar claims by psychologists. I can see that several humans were much better than ChanGPT (especially in the story analogy case), the average is lower but the spread is really big. Is it standard to compare averages of human behavior?
Good questions. I think to answer #1 more probing is needed than was done in the Webb et al. paper. For #2, I think you're quite right -- it's questionable how to interpret averages of human behavior.
"However, I did try one of my favorites: abc —>abd, xyz —> ? GPT-3 returned the strange answer xye."
I can't help wondering if GPT-3 was applying the “basic successor” three-letter-string pattern to the phonetic spelling of 'z' as 'zee' to arrive at 'e'.
Hi Melanie. Did you try prompts that encourage chain of thought reasoning or self-reflection on the initial answer? It seems like performance on some of these tasks might benefit from that.
No, and I don't think Webb et al. tried this either. Feel free to try & report back. The problem, of course, is that these specific problems may have made it into the training set of these systems.
I tried the letter-string analogies and got wrong answer and wrong explanation:
Here is a possible solution to the riddle:
mno -> mnp, ijk -> ilk
The pattern here is to change the last letter of the first word to the next letter in the alphabet, and to change the second letter of the second word to the next letter in the alphabet.
J'ai lu votre livre en français pour mieux comprendre ce que pouvait faire ou non l'IA, c'était très intéressant.
Je suis content de lire ce blog détaillé. Les posts twitter étaient trop courts.
Merci!
Really enjoyed this essay. We have some many articles discussing the technical details of autoregressive language models as well as different metrics to evaluate training performance. However, there are still so many challenges in terms of robust and common-sense benchmarks -- they are just hard to formalize. Anyways, I just wanted to say that this was a very refreshing read!
Thank you!