All tags
Person: "tulseedoshi"
new Gemini 3 Deep Think, Anthropic $30B @ $380B, GPT-5.3-Codex Spark, MiniMax M2.5
gemini-3-deep-think-v2 arc-agi-2 google-deepmind google geminiapp arcprize benchmarking reasoning test-time-adaptation fluid-intelligence scientific-computing engineering-workflows 3d-modeling cost-analysis demishassabis sundarpichai fchollet jeffdean oriolvinyalsml tulseedoshi
Google DeepMind is rolling out the upgraded Gemini 3 Deep Think V2 reasoning mode to Google AI Ultra subscribers and opening early access to the Vertex AI / Gemini API for select users. Key benchmark achievements include ARC-AGI-2 at 84.6%, Humanity’s Last Exam (HLE) at 48.4% without tools, and a Codeforces Elo of 3455, showcasing Olympiad-level performance in physics and chemistry. The mode emphasizes practical scientific and engineering applications such as error detection in math papers, physical system modeling, semiconductor optimization, and a sketch to CAD/STL pipeline for 3D printing. ARC benchmark creator François Chollet highlights the benchmark's role in advancing test-time adaptation and fluid intelligence, projecting human-AI parity around 2030. This rollout is framed as a productized, compute-heavy test-time mode rather than a lab demo, with cost disclosures for ARC tasks provided.
Qwen-Image: SOTA text rendering + 4o-imagegen-level Editing Open Weights MMDiT
qwen-image mmdit gemini-2.5 o3-pro seedprover glm-4.5 xbai-o4 hunyuan alibaba google-deepmind openai bytedance kaggle tencent bilingual-text-rendering image-generation image-editing synthetic-data reasoning math-theorem-proving benchmarking instruction-following model-efficiency open-weight-models model-transparency competitive-evaluation swyx demishassabis tulseedoshi mparakhin teortaxestex cgeorgiaw dorialexander steph_palazzolo corbtt synthwavedd epochairesearch
Alibaba surprised with the release of Qwen-Image, a 20B MMDiT model excelling at bilingual text rendering and graphic poster creation, with open weights and demos available. Google DeepMind launched Gemini 2.5 Deep Think to Ultra subscribers, showing significant reasoning improvements and benchmark gains (+11.2% AIME, +13.2% HLE, +13.4% LiveCodeBench) rivaling OpenAI's o3 Pro. ByteDance's SeedProver achieved state-of-the-art math theorem proving results, surpassing DeepMind's AlphaGeometry2. OpenAI is developing a "universal verifier" for math and coding gains transfer. Competitive reasoning benchmarks and game arenas by Google and Kaggle highlight a meta-shift in reasoning model efficiency, comparable to the original Transformer leap. Other open-weight models gaining momentum include GLM-4.5, XBai o4, and Tencent Hunyuan with a focus on efficient training. "Qwen is all you need."