A Fun Puzzle Here’s a fun puzzle for you. I’ll give you six words in an alien language: saa, guu, ree, fii, hoo, and muo. Figure 1 gives a diagram showing how either single words or combinations of words result in combinations of colored circles. Given the example sentences, what combination of colored circles should result from the query sentence
As a lay person I really enjoy your articles and always find them so interesting! Thank you for breaking things down to a level that makes sense. As an aside to the reviewed paper I found this one
https://arxiv.org/abs/2309.03886 interesting and having a similar flavor if not theme, to leverage LLM's to figure out a blackbox function. Was wondering if it might be on the reading group's short list?
Thanks for the brilliant summary. I had the paper on the to-read list and now I have marked it as done. There are lots of works training models to solve meta problems using in-context generalization such as TabPFN or the work on linear regression with transformers or even works on causal discovery using similar approach etc.
Can we obtain systematic generalization to all distributions we care about by widening the problem generating distribution or is it just pushing the problem one step back? And where does agency and embodiment and causality fit in this way of solving generalization?
As you point out, the choice to train on 80% correct training examples suggests that there was something unexpected going on to produce this result. Lachter and Bever point out that the choice of representation is also not neutral. They were concerned with a connectionist assertion that the model would learn the past tense form of verbs and they pointed out the choice of representations for the language controlled the results.
I can conceive of the possibility that there really is something to this meta-learning potential, but I would remain skeptical until we can convincingly rule out that the way the problem was translated into an embedding and the way the model was structured was not sufficient to explain the results.
Maybe not exactly 'symbolic machinery' was needed, but somehow maybe a 'heuristic' was encoded in the training set or emerged from the training. That 80% achievement makes me think a heuristic might be involved. Blame it on Yannic for making me think really, really hard because of this useful 'prank' of his.
"But given the explicit training [in human-like error modes], I didn’t understand why this would [be] surprising, and I didn’t see what insights such results provide."
You're putting that more diplomatically than I would :-)
Indeed, you're right. The solution I gave assumes a particular underlying grammar, which is also assumed in the Lake & Baroni paper. But in principle there could be an infinite number of possible solutions.
The problem domain of the puzzles seems to be limited, so there is a big chance that the model saw the test examples during training. If the authors did not explicitly control for this then it's a big weak spot in their conclusions. Just supplying random examples from the meta-grammar is not enough because with a limited problem domain a large number of training examples will cover most of the problem domain even if the examples are supplied randomly.
There is a simple unlimited problem domain that can be used for testing compositionality, systematicity and productivity - arithmetic problems. As far as I know, so far LLMs fail on even simple arithmetic tasks when large numbers are involved (a lot of possible combinations of digits that can not all be memorized by the LLM during training time). It puzzles me why LLM researchers keep avoiding this problem. I think it's a huge elephant in the LLM room.
For the record, my IQ is around 140. The test is problematic because it fails to provide enough clear examples to effectively test pattern recognition skills. It's more about asking participants how they arrived at their conclusions and what logic makes sense to them, which deviates from its original purpose and leads to various interpretations.
As a small natural language model, I totally failed the test and not even got a clue about what the hell it is about. I owe apologies to the entire human race.
Melanie, apologies if you've written about this and I missed it, but have you published a position and plan for the situation with Substack platforming and monetizing nazis?
Really fascinating paper. I'm trying to find where they compared GPT-4 and got 58% accurate results? What I'm curious about is whether the same few-shot prompting approach was used or not. Also, it would be interesting to see the results of fine-tuning GPT-4 on the meta-grammar examples to see whether that improves the output.
While the puzzles and methods are nice, I think the real headline is that GPT-4 gets 58%! This is pretty good "symbolic behaviour without symbols". It's much closer to "human-like" than the proposed method, because it doesn't require any special training on 100k examples. And ok, 58 is less than 80, but probably this gap will disappear with the next large general-purpose model.
As a lay person I really enjoy your articles and always find them so interesting! Thank you for breaking things down to a level that makes sense. As an aside to the reviewed paper I found this one
https://arxiv.org/abs/2309.03886 interesting and having a similar flavor if not theme, to leverage LLM's to figure out a blackbox function. Was wondering if it might be on the reading group's short list?
Thanks for the kind words and for the paper link!
Nice discussion of the paper! I had a similar take. One additional issue with the approach is that they don't really handle productivity.
Thanks for the brilliant summary. I had the paper on the to-read list and now I have marked it as done. There are lots of works training models to solve meta problems using in-context generalization such as TabPFN or the work on linear regression with transformers or even works on causal discovery using similar approach etc.
Can we obtain systematic generalization to all distributions we care about by widening the problem generating distribution or is it just pushing the problem one step back? And where does agency and embodiment and causality fit in this way of solving generalization?
I deeply appreciate your excellent analysis. In many ways it parallels an analysis by Lachter and Bever (https://www.researchgate.net/profile/Thomas-Bever/publication/19806078_The_relation_between_linguistic_structure_and_associative_theories_of_language_learning-A_constructive_critique_of_some_connectionist_learning_models/links/5bf9c4f292851ced67d5f474/The-relation-between-linguistic-structure-and-associative-theories-of-language-learning-A-constructive-critique-of-some-connectionist-learning-models.pdf).
As you point out, the choice to train on 80% correct training examples suggests that there was something unexpected going on to produce this result. Lachter and Bever point out that the choice of representation is also not neutral. They were concerned with a connectionist assertion that the model would learn the past tense form of verbs and they pointed out the choice of representations for the language controlled the results.
I can conceive of the possibility that there really is something to this meta-learning potential, but I would remain skeptical until we can convincingly rule out that the way the problem was translated into an embedding and the way the model was structured was not sufficient to explain the results.
Maybe not exactly 'symbolic machinery' was needed, but somehow maybe a 'heuristic' was encoded in the training set or emerged from the training. That 80% achievement makes me think a heuristic might be involved. Blame it on Yannic for making me think really, really hard because of this useful 'prank' of his.
https://youtu.be/rUf3ysohR6Q?si=oMHAzEgMzvEELzQP
Very exciting! Fundamental theories of thinking and upper ontologies merging to sense make for us seems to be 2024’s theme.
"But given the explicit training [in human-like error modes], I didn’t understand why this would [be] surprising, and I didn’t see what insights such results provide."
You're putting that more diplomatically than I would :-)
In the examples given, the correct statement is not “this is a solution”, but “this is one of the possible solutions”.
Indeed, you're right. The solution I gave assumes a particular underlying grammar, which is also assumed in the Lake & Baroni paper. But in principle there could be an infinite number of possible solutions.
The problem domain of the puzzles seems to be limited, so there is a big chance that the model saw the test examples during training. If the authors did not explicitly control for this then it's a big weak spot in their conclusions. Just supplying random examples from the meta-grammar is not enough because with a limited problem domain a large number of training examples will cover most of the problem domain even if the examples are supplied randomly.
There is a simple unlimited problem domain that can be used for testing compositionality, systematicity and productivity - arithmetic problems. As far as I know, so far LLMs fail on even simple arithmetic tasks when large numbers are involved (a lot of possible combinations of digits that can not all be memorized by the LLM during training time). It puzzles me why LLM researchers keep avoiding this problem. I think it's a huge elephant in the LLM room.
Thanks for your comment. The authors did check to make sure the test examples were not in the training data.
For the record, my IQ is around 140. The test is problematic because it fails to provide enough clear examples to effectively test pattern recognition skills. It's more about asking participants how they arrived at their conclusions and what logic makes sense to them, which deviates from its original purpose and leads to various interpretations.
As a small natural language model, I totally failed the test and not even got a clue about what the hell it is about. I owe apologies to the entire human race.
Melanie, apologies if you've written about this and I missed it, but have you published a position and plan for the situation with Substack platforming and monetizing nazis?
Thank you!
Really fascinating paper. I'm trying to find where they compared GPT-4 and got 58% accurate results? What I'm curious about is whether the same few-shot prompting approach was used or not. Also, it would be interesting to see the results of fine-tuning GPT-4 on the meta-grammar examples to see whether that improves the output.
It's in the supplementary information: https://cims.nyu.edu/~brenden/papers/LakeBaroniNatureSI.pdf
While the puzzles and methods are nice, I think the real headline is that GPT-4 gets 58%! This is pretty good "symbolic behaviour without symbols". It's much closer to "human-like" than the proposed method, because it doesn't require any special training on 100k examples. And ok, 58 is less than 80, but probably this gap will disappear with the next large general-purpose model.
Enjoyed your public lecture on the future of AI - would it be possible to get a copy of your slides?
Yes, please email me.
I would email you if I could find your address … mine is martin.antony.walker@gmail.com