Researchers developed a new test for AI that requires planning and predicting changes in a simulated world. When tested, the top LLMs were all beaten by average humans.
We found that humans outperform the models, and scaling compute improves performance only in some environments but not others.
…
We evaluate 517 human participants and three state-of-the-art reasoning models (Anthropic Claude, OpenAI o3, and Google Gemini 2.5 Pro) on AutumnBench. We analyze their interaction trajectories and performance on challenge tasks, revealing substantial headroom for the reasoning models.
“revealing substantial headroom” is a polite way of saying they failed miserably.
Researchers developed a new test for AI that requires planning and predicting changes in a simulated world. When tested, the top LLMs were all beaten by average humans.
> We found that humans outperform the models, and scaling compute improves performance only in some environments but not others.
…
> We evaluate 517 human participants and three state-of-the-art reasoning models (Anthropic Claude, OpenAI o3, and Google Gemini 2.5 Pro) on AutumnBench. We analyze their interaction trajectories and performance on challenge tasks, revealing substantial headroom for the reasoning models.
“revealing substantial headroom” is a polite way of saying they failed miserably.