All tags
Topic: "autoregressive-models"
not much happened today
cosmos-3 nemotron-3-ultra minimax-m3 nvidia runway novita vercel cloudflare openclaude flowith omnimodal-models mixture-of-experts autoregressive-models diffusion-models structured-prompts fine-tuning open-weight-models multimodality agent-models benchmarking model-serving context-windows token-efficiency kimmonismus clementdelangue artificialanalysis scaling01 ctnzr caspar_br eliebakouch pbdtokenrouter rauchg gitlawb notjazii lostinlatencyx zhihufrontier
NVIDIA led open-source AI model releases with Cosmos 3, a comprehensive omnimodal world model unifying language, image, video, audio, and action using a Mixture-of-Transformers design, and Nemotron 3 Ultra, a 550B parameter open-weight model noted for high serving speed and strong evaluation performance. The Cosmos Coalition was launched to foster an open ecosystem for physical AI world models. Meanwhile, MiniMax M3 debuted as a multimodal agent/coding model with 1M context and strong benchmark scores, gaining rapid ecosystem support from vendors like Novita and Vercel AI Gateway. However, MiniMax M3 showed some inefficiencies such as high token consumption and verbose self-check loops. These developments highlight advances in open physical AI, multimodality, and agent models with significant community and infrastructure engagement.
DeepSeek-OCR finds vision models can decode 10x more efficiently with ~97% accuracy of text-only, 33/200k pages/day/A100
deepseek-ocr deepseek3b-moe-a570m veo-3.1 deepseek-ai google-deepmind krea ocr vision multimodality model-compression long-context model-architecture video-generation autoregressive-models model-efficiency precision-editing karpathy teortaxestex reach_vb _akhaliq eliebakouch vikhyatk demishassabis
As ICCV 2025 begins, DeepSeek releases a novel DeepSeek-OCR 3B MoE vision-language model that compresses long text as visual context with high accuracy and efficiency, challenging traditional tokenization approaches. The model achieves ~97% decoding precision at <10× compression and processes up to ~33M pages/day on 20 A100-40G nodes, outperforming benchmarks like GOT-OCR2.0. Discussions highlight the potential for unlimited context windows and tokenization-free inputs, with contributions from @karpathy, @teortaxesTex, and others. In video generation, google-deepmind's Veo 3.1 leads community benchmarks with advanced precision editing and scene blending, while Krea open-sources a 14B autoregressive video model enabling realtime long-form generation at ~11 FPS on a single B200 GPU.
Gemini 2.5 Pro + 4o Native Image Gen
gemini-2.5-pro gpt-4o google-deepmind openai lmarena_ai autoregressive-models multimodality reasoning coding instruction-following model-release leaderboards noam-shazeer allan-jabri gabe-goh
Gemini 2.5 Pro from Google DeepMind has become the new top AI model, surpassing Grok 3 by 40 LMarena points, with contributions from Noam Shazeer integrating Flash Thinking techniques. It is available as a free, rate-limited experimental model. Meanwhile, OpenAI released GPT 4o Native Images, an autoregressive image generation model with detailed insights shared by Allan Jabri and credits to Gabe Goh. Gemini 2.5 Pro excels in reasoning, coding, STEM, multimodal tasks, and instruction following, topping the LMarena leaderboard significantly. It is accessible via Google AI Studio and the Gemini App.