Discussion about this post

User's avatar
Felipe A. Zubia's avatar

Dr. Mitchell, great article and as usual, your accomplishments here are understated.

Your perturbation results suggest that analogy in current systems is conditionally evoked rather than structurally grounded, which helps explain why accuracy alone systematically overstates competence. That distinction feels like the missing invariant in much of benchmark discourse. What your analysis achieves is an empirical location of where that boundary actually sits.

Narrow benchmarks themselves aren’t the problem; systems improve along the dimensions that are incentivized, so the behavior we observe ultimately reflects which benchmarks are rewarded.

In my own work, I’ve seen a similar boundary in what I think of as the Alex-the-parrot regime: fluent symbolic behavior without an abstraction layer that survives representational re-encoding, and where coherence improves markedly when systems are designed around architectural constraints that distinguish surface competence from structurally grounded capacity.

Paul Topping's avatar

Just FYI, your talk is available to me via the link you gave even though I am NOT registered at NeurIPS 2025. Thanks.

7 more comments...

No posts

Ready for more?