Average Humans Beat All Top LLMs At New Reasoning Test

Welcome • User Guide
ToS • Privacy • Canary
Donate • Bugs • License

©2026 Poal.co

127

•

Average Humans Beat All Top LLMs At New Reasoning Test (arxiv.org)

Researchers developed a new test for AI that requires planning and predicting changes in a simulated world. When tested, the top LLMs were all beaten by average humans.

We found that humans outperform the models, and scaling compute improves performance only in some environments but not others.

…

We evaluate 517 human participants and three state-of-the-art reasoning models (Anthropic Claude, OpenAI o3, and Google Gemini 2.5 Pro) on AutumnBench. We analyze their interaction trajectories and performance on challenge tasks, revealing substantial headroom for the reasoning models.

“revealing substantial headroom” is a polite way of saying they failed miserably.

Researchers developed a new test for AI that requires planning and predicting changes in a simulated world. When tested, the top LLMs were all beaten by average humans. > We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. … > We evaluate 517 human participants and three state-of-the-art reasoning models (Anthropic Claude, OpenAI o3, and Google Gemini 2.5 Pro) on AutumnBench. We analyze their interaction trajectories and performance on challenge tasks, revealing substantial headroom for the reasoning models. “revealing substantial headroom” is a polite way of saying they failed miserably.

(post is archived)

[–] • 1 pt

Now you have me interested in "average humans"

link

[–] • 1 pt

If you’re wondering how they chose the study participants, this is what they said:

We recruited 517 English-speaking participants via Prolific. To ensure the quality of our sample, we included only individuals who were not color blind and who successfully passed both attention and comprehension checks (Muszyński, 2023).

Prolific (prolific.com) looks like an online service where people sign up to participate in studies for money. I don’t know how Prolific vets their participants, but if the study group was full of pajeets and others they were least intelligent enough to understand English.

parent
link

/s/AI

created ago

About this sub

All things AI. If you don't know where to drop it, this is a good place for it. This can include things related to business, work, life, robots, etc...

Some things should fit in other subs such as:

AI videos/entertainment - /s/AIStories

Trash AI stuff (slop) - /s/AISLOP

Interesting AI generated stuff (pictures, videos, music) - /s/AICreations

Flare will be created as it makes sense. If you have a suggestion for a flare contact the owner or a mod.

If you want to be a mod, contact the owner.