A worldwide initiative named “Humanity’s Last Exam” is on the hunt for challenging, expert-level questions to evaluate the advancement of artificial intelligence, as the current popular benchmarks have become trivial for sophisticated models.
On Monday, a coalition of tech experts made a global appeal for the most difficult questions to present to artificial intelligence systems, which have increasingly breezed through well-known benchmark tests.
The initiative, referred to as “Humanity’s Last Exam,” aims to pinpoint the arrival of true expert-level AI. Organizers from a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI hope to keep the project relevant as capabilities continue to evolve in the years to come.
This call for questions follows the recent announcement of a new model by the creators of ChatGPT, called OpenAI o1, which reportedly “destroyed the most popular reasoning benchmarks,” according to Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk’s xAI startup.
Hendrycks co-authored two papers in 2021 proposing tests for AI systems that have since gained traction; one assessed undergraduate-level understanding in subjects like U.S. history, while the other evaluated models’ reasoning abilities using competitive-level math. The undergraduate exam has been downloaded more frequently from the AI platform Hugging Face than any other dataset of its kind.
At the time those papers were released, AI models were providing nearly random answers. “They’ve now been trounced,” Hendrycks told Reuters.
For example, the Claude models developed by AI lab Anthropic improved dramatically, rising from a score of approximately 77% on the undergraduate-level test in 2023 to nearly 89% the following year, according to a prominent capabilities ranking.
However, AI has yet to surpass human intelligence.
According to a report from Stanford University’s AI Index published in April, AI has struggled with less commonly used assessments involving planning and visual pattern-recognition challenges. For instance, OpenAI o1 scored around 21% on one variant of the pattern-recognition ARC-AGI test, as reported by the ARC organizers on Friday.
Some AI researchers suggest that such results indicate that planning and abstract reasoning are more reliable indicators of intelligence. However, Hendrycks mentioned that the visual components of the ARC may not be ideal for evaluating language models. “Humanity’s Last Exam” will specifically require abstract reasoning, he stated.
Industry experts have noted that answers from frequently used benchmarks might have been included in the data training AI systems. To ensure that responses from “Humanity’s Last Exam” aren’t simply regurgitated from memory, Hendrycks said that some questions will remain confidential.
The exam aims to comprise at least 1,000 crowd-sourced questions by November 1, designed to challenge non-experts. These questions will go through a peer review process, with winning entries receiving co-authorship and prizes of up to $5,000 sponsored by Scale AI.
“There is an urgent need for more challenging tests for expert-level models to keep track of the rapid evolution of AI,” stated Alexandr Wang, the CEO of Scale AI.
One key restriction is that the organizers do not wish to include any questions related to weapons, which many believe would pose a significant danger if explored by AI.