All tags
Topic: "cost-analysis"
new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
gemini-3-deep-think-v2 arc-agi-2 google-deepmind google geminiapp arcprize benchmarking reasoning test-time-adaptation fluid-intelligence scientific-computing engineering-workflows 3d-modeling cost-analysis demishassabis sundarpichai fchollet jeffdean oriolvinyalsml tulseedoshi
Google DeepMind is rolling out the upgraded Gemini 3 Deep Think V2 reasoning mode to Google AI Ultra subscribers and opening early access to the Vertex AI / Gemini API for select users. Key benchmark achievements include ARC-AGI-2 at 84.6%, Humanity’s Last Exam (HLE) at 48.4% without tools, and a Codeforces Elo of 3455, showcasing Olympiad-level performance in physics and chemistry. The mode emphasizes practical scientific and engineering applications such as error detection in math papers, physical system modeling, semiconductor optimization, and a sketch to CAD/STL pipeline for 3D printing. ARC benchmark creator François Chollet highlights the benchmark's role in advancing test-time adaptation and fluid intelligence, projecting human-AI parity around 2030. This rollout is framed as a productized, compute-heavy test-time mode rather than a lab demo, with cost disclosures for ARC tasks provided.
not much happened today
gemini-2.5-flash gemini-2.0-flash mistral-medium-3 llama-4-maverick claude-3.7-sonnet qwen3 pangu-ultra-moe deepseek-r1 o4-mini x-reasoner google-deepmind mistral-ai alibaba huawei openai microsoft deepseek model-performance reasoning cost-analysis reinforcement-learning chain-of-thought multilinguality code-search model-training vision model-integration giffmana artificialanlys teortaxestex akhaliq john__allard
Gemini 2.5 Flash shows a 12 point increase in the Artificial Analysis Intelligence Index but costs 150x more than Gemini 2.0 Flash due to 9x more expensive output tokens and 17x higher token usage during reasoning. Mistral Medium 3 competes with Llama 4 Maverick, Gemini 2.0 Flash, and Claude 3.7 Sonnet with better coding and math reasoning at a significantly lower price. Alibaba's Qwen3 family supports reasoning and multilingual tasks across 119 languages and includes a Web Dev tool for app building. Huawei's Pangu Ultra MoE matches DeepSeek R1 performance on Ascend NPUs, with new compute and upcoming V4 training. OpenAI's o4-mini now supports Reinforcement Fine-Tuning (RFT) using chain-of-thought reasoning. Microsoft's X-REASONER enables generalizable reasoning across modalities post-trained on general-domain text. Deep research integration with GitHub repos in ChatGPT enhances codebase search and reporting. The AI Engineer World's Fair offers an Early Bird discount for upcoming tickets.