Discussion about this post

User's avatar
Ben Dickson's avatar

For me, the most important part is this: "when people focus on a specific target (e.g., top score on the ARC private evaluation set), they sometimes lose sight of what the benchmark is actually trying to measure (e.g., capacities for few-shot abstraction based on core knowledge concepts), and the methods developed to hit the target might miss the original motivation altogether."

The benchmark has "Abstract" in the title for a good reason. It is not supposed to be beaten through brute-force methods and sampling multiple answers, hoping one of them will work.

Expand full comment
Nathan Lambert's avatar

Pretty easy to avoid these pitfalls in my future evaluation projects.

1. Add $1million in prize money

2. Name it AGI

Hah. Nice piece!

Expand full comment
32 more comments...

No posts