23 Comments

As a lay person I really enjoy your articles and always find them so interesting! Thank you for breaking things down to a level that makes sense. As an aside to the reviewed paper I found this one

https://arxiv.org/abs/2309.03886 interesting and having a similar flavor if not theme, to leverage LLM's to figure out a blackbox function. Was wondering if it might be on the reading group's short list?

Expand full comment

Thanks for the kind words and for the paper link!

Expand full comment

Nice discussion of the paper! I had a similar take. One additional issue with the approach is that they don't really handle productivity.

Expand full comment

Thanks for the brilliant summary. I had the paper on the to-read list and now I have marked it as done. There are lots of works training models to solve meta problems using in-context generalization such as TabPFN or the work on linear regression with transformers or even works on causal discovery using similar approach etc.

Can we obtain systematic generalization to all distributions we care about by widening the problem generating distribution or is it just pushing the problem one step back? And where does agency and embodiment and causality fit in this way of solving generalization?

Expand full comment

I deeply appreciate your excellent analysis. In many ways it parallels an analysis by Lachter and Bever (https://www.researchgate.net/profile/Thomas-Bever/publication/19806078_The_relation_between_linguistic_structure_and_associative_theories_of_language_learning-A_constructive_critique_of_some_connectionist_learning_models/links/5bf9c4f292851ced67d5f474/The-relation-between-linguistic-structure-and-associative-theories-of-language-learning-A-constructive-critique-of-some-connectionist-learning-models.pdf).

As you point out, the choice to train on 80% correct training examples suggests that there was something unexpected going on to produce this result. Lachter and Bever point out that the choice of representation is also not neutral. They were concerned with a connectionist assertion that the model would learn the past tense form of verbs and they pointed out the choice of representations for the language controlled the results.

I can conceive of the possibility that there really is something to this meta-learning potential, but I would remain skeptical until we can convincingly rule out that the way the problem was translated into an embedding and the way the model was structured was not sufficient to explain the results.

Expand full comment

Maybe not exactly 'symbolic machinery' was needed, but somehow maybe a 'heuristic' was encoded in the training set or emerged from the training. That 80% achievement makes me think a heuristic might be involved. Blame it on Yannic for making me think really, really hard because of this useful 'prank' of his.

https://youtu.be/rUf3ysohR6Q?si=oMHAzEgMzvEELzQP

Expand full comment

Very exciting! Fundamental theories of thinking and upper ontologies merging to sense make for us seems to be 2024’s theme.

Expand full comment

"But given the explicit training [in human-like error modes], I didn’t understand why this would [be] surprising, and I didn’t see what insights such results provide."

You're putting that more diplomatically than I would :-)

Expand full comment

In the examples given, the correct statement is not “this is a solution”, but “this is one of the possible solutions”.

Expand full comment

Indeed, you're right. The solution I gave assumes a particular underlying grammar, which is also assumed in the Lake & Baroni paper. But in principle there could be an infinite number of possible solutions.

Expand full comment

The problem domain of the puzzles seems to be limited, so there is a big chance that the model saw the test examples during training. If the authors did not explicitly control for this then it's a big weak spot in their conclusions. Just supplying random examples from the meta-grammar is not enough because with a limited problem domain a large number of training examples will cover most of the problem domain even if the examples are supplied randomly.

There is a simple unlimited problem domain that can be used for testing compositionality, systematicity and productivity - arithmetic problems. As far as I know, so far LLMs fail on even simple arithmetic tasks when large numbers are involved (a lot of possible combinations of digits that can not all be memorized by the LLM during training time). It puzzles me why LLM researchers keep avoiding this problem. I think it's a huge elephant in the LLM room.

Expand full comment

Thanks for your comment. The authors did check to make sure the test examples were not in the training data.

Expand full comment

For the record, my IQ is around 140. The test is problematic because it fails to provide enough clear examples to effectively test pattern recognition skills. It's more about asking participants how they arrived at their conclusions and what logic makes sense to them, which deviates from its original purpose and leads to various interpretations.

Expand full comment

As a small natural language model, I totally failed the test and not even got a clue about what the hell it is about. I owe apologies to the entire human race.

Expand full comment

Melanie, apologies if you've written about this and I missed it, but have you published a position and plan for the situation with Substack platforming and monetizing nazis?

Expand full comment

Thank you!

Expand full comment

Really fascinating paper. I'm trying to find where they compared GPT-4 and got 58% accurate results? What I'm curious about is whether the same few-shot prompting approach was used or not. Also, it would be interesting to see the results of fine-tuning GPT-4 on the meta-grammar examples to see whether that improves the output.

Expand full comment

It's in the supplementary information: https://cims.nyu.edu/~brenden/papers/LakeBaroniNatureSI.pdf

Expand full comment

While the puzzles and methods are nice, I think the real headline is that GPT-4 gets 58%! This is pretty good "symbolic behaviour without symbols". It's much closer to "human-like" than the proposed method, because it doesn't require any special training on 100k examples. And ok, 58 is less than 80, but probably this gap will disappear with the next large general-purpose model.

Expand full comment

Enjoyed your public lecture on the future of AI - would it be possible to get a copy of your slides?

Expand full comment

Yes, please email me.

Expand full comment

I would email you if I could find your address … mine is martin.antony.walker@gmail.com

Expand full comment