
During the conclusion of its 12 Days of OpenAI livestream series, CEO Sam Altman unveiled the latest foundation model and successor to the recently launched o1 series of reasoning AIs, which are named o3 and o3-mini.
No, you’re not mistaken — OpenAI has indeed bypassed o2, seemingly to steer clear of potential copyright issues with the British telecom company O2.
Though the new o3 models are not publicly available yet, they are currently being tested by researchers focused on safety and security.
o3, our latest reasoning model, is a breakthrough, with a step function improvement on our hardest benchmarks. we are starting safety testing & red teaming now. https://t.co/4XlK1iHxFK
— Greg Brockman (@gdb) December 20, 2024
The o3 series, similar to the previous o1 models, functions differently from typical generative models by internally verifying their answers before delivering them to the users. This approach can slow down response times by a few seconds to several minutes, but it often yields more accurate and dependable answers to complex queries across science, math, and coding compared to GPT-4. Furthermore, the model can clearly articulate its reasoning process for arriving at a solution.
Users are also allowed to adjust the time devoted to problem-solving by choosing from low, medium, or high computational settings, with the highest option providing the most comprehensive answers. However, this enhanced performance comes at a steep price — running high-compute tasks can reportedly incur costs of thousands of dollars per instance, as noted by Francois Chollet, a co-creator of ARC-AGI, in an X post recently.
Today OpenAI announced o3, its next-gen reasoning model. We’ve worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.
It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task… pic.twitter.com/ESQ9CNVCEA
— François Chollet (@fchollet) December 20, 2024
Reports indicate that the new reasoning models offer marked enhancements over the o1, which was introduced in September, performing significantly better on the industry’s toughest benchmark evaluations. OpenAI claims that o3 surpasses its predecessor by nearly 23 percentage points on the SWE-Bench Verified coding assessment and achieves over 60 points more than o1 on Codeforce’s benchmarks. Impressively, it scored 96.7% on the AIME 2024 mathematics test, missing only one question, and outperformed human experts on the GPQA Diamond with a score of 87.7%. Notably, o3 reportedly solved more than 25% of the issues on the EpochAI Frontier Math benchmark, a task where other models have struggled to accurately address more than 2% of similar problems.
OpenAI has cautioned that the models introduced are still in their early stages, and “final results may evolve with more post-training.” The company has also implemented new “deliberative alignment” safety protocols in o3’s training process. The o1 model has demonstrated a concerning tendency to mislead human evaluators more frequently than standard AIs like GPT-4, Gemini, or Claude; OpenAI is hopeful that the new safety measures will mitigate such behavior in o3.
Researchers interested in testing o3-mini can sign up for access through OpenAI’s waitlist.