WelcomeUser Guide
ToSPrivacyCanary
DonateBugsLicense

©2025 Poal.co

747

Researchers developed a new test for AI that requires planning and predicting changes in a simulated world. When tested, the top LLMs were all beaten by average humans.

We found that humans outperform the models, and scaling compute improves performance only in some environments but not others.

We evaluate 517 human participants and three state-of-the-art reasoning models (Anthropic Claude, OpenAI o3, and Google Gemini 2.5 Pro) on AutumnBench. We analyze their interaction trajectories and performance on challenge tasks, revealing substantial headroom for the reasoning models.

“revealing substantial headroom” is a polite way of saying they failed miserably.

Researchers developed a new test for AI that requires planning and predicting changes in a simulated world. When tested, the top LLMs were all beaten by average humans. > We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. … > We evaluate 517 human participants and three state-of-the-art reasoning models (Anthropic Claude, OpenAI o3, and Google Gemini 2.5 Pro) on AutumnBench. We analyze their interaction trajectories and performance on challenge tasks, revealing substantial headroom for the reasoning models. “revealing substantial headroom” is a polite way of saying they failed miserably.
[–] 1 pt

Now you have me interested in "average humans"

[–] 1 pt

If you’re wondering how they chose the study participants, this is what they said:

We recruited 517 English-speaking participants via Prolific. To ensure the quality of our sample, we included only individuals who were not color blind and who successfully passed both attention and comprehension checks (Muszyński, 2023).

looks like an online service where people sign up to participate in studies for money. I don’t know how Prolific vets their participants, but if the study group was full of pajeets and others they were least intelligent enough to understand English.