As OpenAI’s ’12 days of shipmas’ draws to a close, the company has made a soft announcement of AGI through the introduction of their next-generation frontier models o3 and o3 Mini. These models have achieved state-of-the-art performance, nearing 90%, on the ARC-AGI benchmark, surpassing human performance. This marks a significant change in just one month, as in November, Sam Altman hinted at achieving this benchmark internally, which was disregarded by Francois Chollet, the creator of the ARC-AGI benchmark. However, with the ‘o’ family of models virtually saturating the benchmark, the ARC team has announced a newer, upgraded evaluation (ARC-AGI benchmark 2). These frontier models will now be accessible to researchers for public safety testing, with o3 Mini set for release in January 2025 and o3 to follow shortly after.
Altman has stated that this marks the beginning of the next phase of AI, but Chollet believes that OpenAI is still not there with AGI. While the new model is impressive and a big milestone towards AGI, there are still easy tasks that o3 cannot solve, and early indications suggest that o3 will struggle with the challenging ARC-AGI-2 tasks. This soft announcement was expected during the 12-days of shipmas, but Altman has been cautious as it could disrupt the contract with lead investor Microsoft and invite more scrutiny from competitors like Google and Anthropic.
In the coming year, companies are actively working towards scaling reasoning capabilities. Google has released Gemini 2.0 Flash Thinking with advanced reasoning capabilities, and Chinese models Qwen and DeepSeek have also joined the race. Meta has hinted at releasing reasoning models next year, with xAI’s Grok and Anthropic expected to follow. OpenAI researchers are betting on the Reinforcement Learning (RL) architecture to further this new paradigm of reasoning. The progress from o1 to o3 in just three months shows how fast progress will be in the new paradigm of RL on the chain of thought to scale inference compute. This aligns with Google DeepMind’s expertise, as former researcher Finbarr Timbers points out.
OpenAI has skipped the name “o2” to avoid trademark concerns and has made a soft announcement to avoid disrupting their contract with Microsoft and inviting more scrutiny from competitors. The RL technique is expected to play a significant role in scaling reasoning capabilities, and OpenAI’s o3 has already beaten the ARC-AGI benchmark.