All tags
Topic: "model-releases"
Grok 4: xAI succeeds in going from 0 to new SOTA LLM in 2 years
grok-4 grok-4-heavy claude-4-opus xai perplexity-ai langchain cursor cline model-releases benchmarking long-context model-pricing model-integration voice performance scaling gpu-optimization elonmusk aravsrinivas igor_babuschkin yuchenj_uw
xAI launched Grok 4 and Grok 4 Heavy, large language models rumored to have 2.4 trillion parameters and trained with 100x more compute than Grok 2 on 100k H100 GPUs. Grok 4 achieved new state-of-the-art results on benchmarks like ARC-AGI-2 (15.9%), HLE (50.7%), and Vending-Bench, outperforming models such as Claude 4 Opus. The model supports a 256K context window and is priced at $3.00/M input tokens and $15.00/M output tokens. It is integrated into platforms like Cursor, Cline, LangChain, and Perplexity Pro/Max. The launch was accompanied by a controversial voice mode and sparked industry discussion about xAI's rapid development pace, with endorsements from figures like Elon Musk and Arav Srinivas.
SmolLM3: the SOTA 3B reasoning open source LLM
smollm3-3b olmo-3 grok-4 claude-4 claude-4.1 gemini-nano hunyuan-a13b gemini-2.5 gemma-3n qwen2.5-vl-3b huggingface allenai openai anthropic google-deepmind mistral-ai tencent gemini alibaba open-source small-language-models model-releases model-performance benchmarking multimodality context-windows precision-fp8 api batch-processing model-scaling model-architecture licensing ocr elonmusk mervenoyann skirano amandaaskell clementdelangue loubnabenallal1 awnihannun swyx artificialanlys officiallogank osanseviero cognitivecompai aravsrinivas
HuggingFace released SmolLM3-3B, a fully open-source small reasoning model with open pretraining code and data, marking a high point in open source models until Olmo 3 arrives. Grok 4 was launched with mixed reactions, while concerns about Claude 4 nerfs and an imminent Claude 4.1 surfaced. Gemini Nano is now shipping in Chrome 137+, enabling local LLM access for 3.7 billion users. Tencent introduced Hunyuan-A13B, an 80B parameter model with a 256K context window running on a single H200 GPU. The Gemini API added a batch mode with 50% discounts on 2.5 models. MatFormer Lab launched tools for custom-sized Gemma 3n models. Open source OCR models like Nanonets-OCR-s and ChatDOC/OCRFlux-3B derived from Qwen2.5-VL-3B were highlighted, with licensing discussions involving Alibaba.
not much happened today
o3-mini o1-mini llama hunyuan-a13b ernie-4.5 ernie-4.5-21b-a3b qwen3-30b-a3b gemini-2.5-pro meta-ai-fair openai tencent microsoft baidu gemini superintelligence ai-talent job-market open-source-models multimodality mixture-of-experts quantization fp8-training model-benchmarking model-performance model-releases api model-optimization alexandr_wang shengjia_zhao jhyuxm ren_hongyu shuchaobi saranormous teortaxesTex mckbrando yuchenj_uw francoisfleuret quanquangu reach_vb philschmid
Meta has poached top AI talent from OpenAI, including Alexandr Wang joining as Chief AI Officer to work towards superintelligence, signaling a strong push for the next Llama model. The AI job market shows polarization with high demand and compensation for top-tier talent, while credentials like strong GitHub projects gain importance. The WizardLM team moved from Microsoft to Tencent to develop open-source models like Hunyuan-A13B, highlighting shifts in China's AI industry. Rumors suggest OpenAI will release a new open-source model in July, potentially surpassing existing ChatGPT models. Baidu open-sourced multiple variants of its ERNIE 4.5 model series, featuring advanced techniques like 2-bit quantization, MoE router orthogonalization loss, and FP8 training, with models ranging from 0.3B to 424B parameters. Gemini 2.5 Pro returned to the free tier of the Gemini API, enabling developers to explore its features.
OpenAI releases Deep Research API (o3/o4-mini)
o3-deep-research o4-mini-deep-research gemma-3n flux-1-kontext-dev gpt-4o alphagenome openai google black-forest-labs deepmind sakana-ai higgsfield-ai huggingface ollama multimodality model-releases agentic-ai reinforcement-learning instruction-following model-architecture model-optimization image-generation biological-ai multi-agent-systems model-integration demishassabis hardmaru osanseviero clementdelangue
OpenAI has launched the Deep Research API featuring powerful models o3-deep-research and o4-mini-deep-research with native support for MCP, Search, and Code Interpreter, enabling advanced agent capabilities including multi-agent setups. Google released Gemma 3n, a multimodal model optimized for edge devices with only 3GB RAM, achieving a top score of 1300 on LMSys Arena, featuring the new MatFormer architecture and broad ecosystem integration. Black Forest Labs introduced FLUX.1 Kontext [dev], a 12B parameter rectified flow transformer for instruction-based image editing, comparable to GPT-4o. DeepMind unveiled AlphaGenome, an AI model capable of reading 1 million DNA bases for gene function prediction, marking a breakthrough in AI biology. Sakana AI presented Reinforcement-Learned Teachers (RLTs) to enhance LLM reasoning, achieving 86.1% on MiniF2F with efficient compute. Higgsfield AI released Higgsfield Soul, a high-aesthetic photo model with 50+ presets for fashion-grade realism. Additionally, Google launched the Gemini CLI, an open-source AI agent for terminal use with free Gemini 2.5 Pro requests.
Gemini 2.5 Pro/Flash GA, 2.5 Flash-Lite in Preview
gemini-2.5 gemini-2.5-flash-lite gemini-2.5-flash gemini-2.5-pro gemini-2.5-ultra kimi-dev-72b nanonets-ocr-s ii-medical-8b-1706 jan-nano deepseek-r1 minimax-m1 google moonshot-ai deepseek cognitivecompai kling-ai mixture-of-experts multimodality long-horizon-planning benchmarking coding-performance long-context ocr video-generation model-releases tulsee_doshi oriolvinyalsml demishassabis officiallogank _philschmid swyx sainingxie scaling01 gneubig clementdelangue mervenoyann
Gemini 2.5 models are now generally available, including the new Gemini 2.5 Flash-Lite, Flash, Pro, and Ultra variants, featuring sparse Mixture-of-Experts (MoE) transformers with native multimodal support. A detailed 30-page tech report highlights impressive long-horizon planning demonstrated by Gemini Plays Pokemon. The LiveCodeBench-Pro benchmark reveals frontier LLMs struggle with hard coding problems, while Moonshot AI open-sourced Kimi-Dev-72B, achieving state-of-the-art results on SWE-bench Verified. Smaller specialized models like Nanonets-OCR-s, II-Medical-8B-1706, and Jan-nano show competitive performance, emphasizing that bigger models are not always better. DeepSeek-r1 ties for #1 in WebDev Arena, and MiniMax-M1 sets new standards in long-context reasoning. Kling AI demonstrated video generation capabilities.
Reasoning Price War 2: Mistral Magistral + o3's 80% price cut + o3-pro
o3 o3-pro gpt-4.1 claude-4-sonnet gemini-2.5-pro magistral-small magistral-medium mistral-small-3.1 openai anthropic google-deepmind mistral-ai perplexity-ai reasoning token-efficiency price-cut benchmarking open-source model-releases context-windows gpu-optimization swyx sama scaling01 polynoamial nrehiew_ kevinweil gdb flavioad stevenheidel aravsrinivas
OpenAI announced an 80% price cut for its o3 model, making it competitively priced with GPT-4.1 and rivaling Anthropic's Claude 4 Sonnet and Google's Gemini 2.5 Pro. Alongside, o3-pro was released as a more powerful and reliable variant, though early benchmarks showed mixed performance relative to cost. Mistral AI launched its Magistral reasoning models, including an open-source 24B parameter version optimized for efficient deployment on consumer GPUs. The price reduction and new model releases signal intensified competition in reasoning-focused large language models, with notable improvements in token efficiency and cost-effectiveness.
not much happened today
deepseek-r1-0528 pali-gemma-2 gemma-3 shieldgemma-2 txgemma gemma-3-qat gemma-3n-preview medgemma dolphingemma signgemma claude-4 opus-4 claude-sonnet-4 codestral-embed bagel qwen nemotron-cortexa gemini-2.5-pro deepseek-ai huggingface gemma claude bytedance qwen nemotron sakana-ai-labs benchmarking model-releases multimodality code-generation model-performance long-context reinforcement-learning model-optimization open-source yuchenj_uw _akhaliq clementdelangue osanseviero alexalbert__ guillaumelample theturingpost lmarena_ai epochairesearch scaling01 nrehiew_ ctnzr
DeepSeek R1 v2 model released with availability on Hugging Face and inference partners. The Gemma model family continues prolific development including PaliGemma 2, Gemma 3, and others. Claude 4 and its variants like Opus 4 and Claude Sonnet 4 show top benchmark performance, including new SOTA on ARC-AGI-2 and WebDev Arena. Codestral Embed introduces a 3072-dimensional code embedder. BAGEL, an open-source multimodal model by ByteDance, supports reading, reasoning, drawing, and editing with long mixed contexts. Benchmarking highlights include Nemotron-CORTEXA topping SWEBench and Gemini 2.5 Pro performing on VideoGameBench. Discussions on random rewards effectiveness focus on Qwen models. "Opus 4 NEW SOTA ON ARC-AGI-2. It's happening - I was right" and "Claude 4 launch has dev moving at a different pace" reflect excitement in the community.
Gemini's AlphaEvolve agent uses Gemini 2.0 to find new Math and cuts Gemini cost 1% — without RL
gemini gpt-4.1 gpt-4o-mini o3 o4-mini google-deepmind openai algorithm-discovery coding-agents matrix-multiplication optimization reinforcement-learning model-weights training-efficiency safety-evaluations instruction-following coding-tasks model-releases _philschmid scott_swingle alex_dimakis henry jason_wei kevinweil michpokrass scaling01 gdb
Deepmind's AlphaEvolve, a 2025 update to AlphaTensor and FunSearch, is a Gemini-powered coding agent for algorithm discovery that designs faster matrix multiplication algorithms, solves open math problems, and improves data center and AI training efficiency. It achieves a 23% faster kernel speedup in Gemini training and surpasses state-of-the-art on 20% of applied problems, including improvements on the Minimum Overlap Problem and Kissing number problem. Unlike Deep-RL, it optimizes code pieces rather than model weights. Meanwhile, OpenAI released GPT-4.1 in ChatGPT, specializing in coding and instruction following, with a faster alternative GPT-4.1 mini replacing GPT-4o mini for all users. OpenAI also launched the Safety Evaluations Hub and the OpenAI to Z Challenge using o3/o4 mini and GPT-4.1 models to discover archaeological sites. "Maybe midtrain + good search is all you need for AI for scientific innovation" - Jason Wei.
Granola launches team notes, while Notion launches meeting transcription
gpt-4.1 gpt-4o-mini gpt-4.1-mini claude-opus claude-sonnet claude-o3 qwen3 seed1.5-vl llama-4 am-thinking-v1 openai anthropic alibaba meta-ai-fair huggingface granola coding instruction-following benchmarking model-releases reasoning image-generation collaborative-software model-performance kevinweil scaling01 steph_palazzolo andersonbcdefg reach_vb yuchenj_uw qtnx_ _akhaliq risingsayak
GPT-4.1 is now available in ChatGPT for Plus, Pro, and Team users, focusing on coding and instruction following, with GPT 4.1 mini replacing GPT 4o mini. Anthropic is releasing new Claude models including Claude Opus and Claude Sonnet, though some criticism about hallucinations in Claude O3 was noted. Alibaba shared the Qwen3 Technical Report with strong benchmark results from Seed1.5-VL. Meta FAIR announced new models and datasets but faced criticism on Llama 4. AM-Thinking-v1 launched on Hugging Face as a 32B scale reasoning model. Granola raised $43M in Series B and launched Granola 2.0 with a Notion-like UI. The AI ecosystem shows rapid iteration and cloning of ideas, emphasizing execution and distribution.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API
gpt-image-1 o3 o4-mini gpt-4.1 eagle-2.5-8b gpt-4o qwen2.5-vl-72b openai nvidia hugging-face x-ai image-generation content-moderation benchmarking long-context multimodality model-performance supercomputing virology video-understanding model-releases kevinweil lmarena_ai _philschmid willdepue arankomatsuzaki epochairesearch danhendrycks reach_vb mervenoyann _akhaliq
OpenAI officially launched the gpt-image-1 API for image generation and editing, supporting features like alpha channel transparency and a "low" content moderation policy. OpenAI's models o3 and o4-mini are leading in benchmarks for style control, math, coding, and hard prompts, with o3 ranking #1 in several categories. A new benchmark called Vending-Bench reveals performance variance in LLMs on extended tasks. GPT-4.1 ranks in the top 5 for hard prompts and math. Nvidia's Eagle 2.5-8B matches GPT-4o and Qwen2.5-VL-72B in long-video understanding. AI supercomputer performance doubles every 9 months, with xAI's Colossus costing an estimated $7 billion and the US dominating 75% of global performance. The Virology Capabilities Test shows OpenAI's o3 outperforms 94% of expert virologists. Nvidia also released the Describe Anything Model (DAM), a multimodal LLM for detailed image and video captioning, now available on Hugging Face.
not much happened today; New email provider for AINews
gpt-4.1 gpt-4o gpt-4o-mini gemini-2.5-flash seaweed-7b claude embed-4 grok smol-ai resend openai google bytedance anthropic cohere x-ai email-deliverability model-releases reasoning video-generation multimodality embedding-models agentic-workflows document-processing function-calling tool-use ai-coding adcock_brett swyx jerryjliu0 alexalbert omarsar0
Smol AI is migrating its AI news email service to Resend to improve deliverability and enable new features like personalizable AI news and a "Hacker News of AI." Recent AI model updates include OpenAI's API-only GPT-4.1, Google Gemini 2.5 Flash reasoning model, ByteDance Seaweed 7B-param video AI, Anthropic Claude's values system, Cohere Embed 4 multimodal embedding model, and xAI Grok updates with Memory and Studio features. Discussions also cover agentic workflows for document automation and AI coding patterns.
Gemini 2.5 Flash completes the total domination of the Pareto Frontier
gemini-2.5-flash o3 o4-mini google openai anthropic tool-use multimodality benchmarking reasoning reinforcement-learning open-source model-releases chain-of-thought coding-agent sama kevinweil markchen90 alexandr_wang polynoamial scaling01 aidan_mclau cwolferesearch
Gemini 2.5 Flash is introduced with a new "thinking budget" feature offering more control compared to Anthropic and OpenAI models, marking a significant update in the Gemini series. OpenAI launched o3 and o4-mini models, emphasizing advanced tool use capabilities and multimodal understanding, with o3 dominating several leaderboards but receiving mixed benchmark reviews. The importance of tool use in AI research and development is highlighted, with OpenAI Codex CLI announced as a lightweight open-source coding agent. The news reflects ongoing trends in AI model releases, benchmarking, and tool integration.
>$41B raised today (OpenAI @ 300b, Cursor @ 9.5b, Etched @ 1.5b)
deepseek-v3-0324 gemini-2.5-pro claude-3.7-sonnet openai deepseek gemini cursor etched skypilot agent-evals open-models model-releases model-performance coding multimodality model-deployment cost-efficiency agent-evaluation privacy kevinweil sama lmarena_ai scaling01 iscienceluvr stevenheidel lepikhin dzhng raizamrtn karpathy
OpenAI is preparing to release a highly capable open language model, their first since GPT-2, with a focus on reasoning and community feedback, as shared by @kevinweil and @sama. DeepSeek V3 0324 has achieved the #5 spot on the Arena leaderboard, becoming the top open model with an MIT license and cost advantages. Gemini 2.5 Pro is noted for outperforming models like Claude 3.7 Sonnet in coding tasks, with upcoming pricing and improvements expected soon. New startups like Sophont are building open multimodal foundation models for healthcare. Significant fundraises include Cursor closing $625M at a $9.6B valuation and Etched raising $85M at $1.5B. Innovations in AI infrastructure include SkyPilot's cost-efficient cloud provisioning and the launch of AgentEvals, an open-source package for evaluating AI agents. Discussions on smartphone privacy highlight iPhone's stronger user defense compared to Android.
Cohere's Command A claims #3 open model spot (after DeepSeek and Gemma)
command-a mistral-ai-small-3.1 smoldocling qwen-2.5-vl cohere mistral-ai hugging-face context-windows multilinguality multimodality fine-tuning benchmarking ocr model-performance model-releases model-optimization aidangomez sophiamyang mervenoyann aidan_mclau reach_vb lateinteraction
Cohere's Command A model has solidified its position on the LMArena leaderboard, featuring an open-weight 111B parameter model with an unusually long 256K context window and competitive pricing. Mistral AI released the lightweight, multilingual, and multimodal Mistral AI Small 3.1 model, optimized for single RTX 4090 or Mac 32GB RAM setups, with strong performance on instruct and multimodal benchmarks. The new OCR model SmolDocling offers fast document reading with low VRAM usage, outperforming larger models like Qwen2.5VL. Discussions highlight the importance of system-level improvements over raw LLM advancements, and MCBench is recommended as a superior AI benchmark for evaluating model capabilities across code, aesthetics, and awareness.
The new OpenAI Agents Platform
reka-flash-3 o1-mini claude-3-7-sonnet llama-3-3-70b sonic-2 qwen-chat olympiccoder openai reka-ai hugging-face deepseek togethercompute alibaba ai-agents api model-releases fine-tuning reinforcement-learning model-training model-inference multimodality voice-synthesis gpu-clusters model-distillation performance-optimization open-source sama reach_vb
OpenAI introduced a comprehensive suite of new tools for AI agents, including the Responses API, Web Search Tool, Computer Use Tool, File Search Tool, and an open-source Agents SDK with integrated observability tools, marking a significant step towards the "Year of Agents." Meanwhile, Reka AI open-sourced Reka Flash 3, a 21B parameter reasoning model that outperforms o1-mini and powers their Nexus platform, with weights available on Hugging Face. The OlympicCoder series surpassed Claude 3.7 Sonnet and much larger models on competitive coding benchmarks. DeepSeek built a 32K GPU cluster capable of training V3-level models in under a week and is exploring AI distillation. Hugging Face announced Cerebras inference support, achieving over 2,000 tokens/s on Llama 3.3 70B, 70x faster than leading GPUs. Reka's Sonic-2 voice AI model delivers 40ms latency via the Together API. Alibaba's Qwen Chat enhanced its multimodal interface with video understanding up to 500MB, voice-to-text, guest mode, and expanded file uploads. Sama praised OpenAI's new API as "one of the most well-designed and useful APIs ever."
not much happened today
aya-vision-8b aya-vision-32b llama-3-2-90b-vision molmo-72b phi-4-mini phi-4-multimodal cogview4 wan-2-1 weights-and-biases coreweave cohereforai microsoft alibaba google llamaindex weaviate multilinguality vision multimodality image-generation video-generation model-releases benchmarking funding agentic-ai model-performance mervenoyann reach_vb jayalammar sarahookr aidangomez nickfrosst dair_ai akhaliq bobvanluijt jerryjliu0
Weights and Biases announced a $1.7 billion acquisition by CoreWeave ahead of CoreWeave's IPO. CohereForAI released the Aya Vision models (8B and 32B parameters) supporting 23 languages, outperforming larger models like Llama-3.2 90B Vision and Molmo 72B. Microsoft introduced Phi-4-Mini (3.8B parameters) and Phi-4-Multimodal models, excelling in math, coding, and multimodal benchmarks. CogView4, a 6B parameter text-to-image model with 2048x2048 resolution and Apache 2.0 license, was released. Alibaba launched Wan 2.1, an open-source video generation model with 720p output and 16 fps generation. Google announced new AI features for Pixel devices including Scam Detection and Gemini integrations. LlamaCloud reached General Availability and raised $19M Series A funding, serving over 100 Fortune 500 companies. Weaviate launched the Query Agent, the first of three Weaviate Agents.
GPT 4.5 — Chonky Orion ships!
gpt-4.5 phi-4-multimodal phi-4-mini command-r7b-arabic openai microsoft cohere creative-writing natural-language-processing multimodality math coding context-windows model-releases open-source arabic-language sama kevinweil aidan_mclau omarsar0 rasbt reach_vb
OpenAI released GPT-4.5 as a research preview, highlighting its deep world knowledge, improved understanding of user intent, and a 128,000 token context window. It is noted for excelling in writing, creative tasks, image understanding, and data extraction but is not a reasoning model. Microsoft unveiled Phi-4 Multimodal and Phi-4 Mini, open-source models integrating text, vision, and speech/audio, with strong performance in math and coding tasks. Cohere released Command R7B Arabic, an open-weights model optimized for Arabic language capabilities targeting enterprises in the MENA region. The community is exploring the impact of larger models on creative writing, intent understanding, and world knowledge, with GPT-4.5 expected to be a basis for GPT-5.
lots of small launches
gpt-4o claude-3.7-sonnet claude-3.7 claude-3.5-sonnet deepseek-r1 deepseek-v3 grok-3 openai anthropic amazon cloudflare perplexity-ai deepseek-ai togethercompute elevenlabs elicitorg inceptionailabs mistral-ai voice model-releases cuda gpu-optimization inference open-source api model-performance token-efficiency context-windows cuda jit-compilation lmarena_ai alexalbert__ aravsrinivas reach_vb
GPT-4o Advanced Voice Preview is now available for free ChatGPT users with enhanced daily limits for Plus and Pro users. Claude 3.7 Sonnet has achieved the top rank in WebDev Arena with improved token efficiency. DeepSeek-R1 with 671B parameters benefits from the Together Inference platform optimizing NVIDIA Blackwell GPU usage, alongside the open-source DeepGEMM CUDA library delivering up to 2.7x speedups on Hopper GPUs. Perplexity launched a new Voice Mode and a Deep Research API. The upcoming Grok 3 API will support a 1M token context window. Several companies including Elicit, Amazon, Anthropic, Cloudflare, FLORA, Elevenlabs, and Inception Labs announced new funding rounds, product launches, and model releases.
not much happened today
grok-3 deepseek-r1 siglip-2 o3-mini-high r1-1776 llamba-1b llamba-3b llamba-8b llama-3 alphamaze audiobox-aesthetics xai nvidia google-deepmind anthropic openai bytedance ollama meta-ai-fair benchmarking model-releases performance reasoning multimodality semantic-understanding ocr multilinguality model-distillation recurrent-neural-networks visual-reasoning audio-processing scaling01 iscienceluvr philschmid arankomatsuzaki reach_vb mervenoyann wightmanr lmarena_ai ollama akhaliq
Grok-3, a new family of LLMs from xAI using 200,000 Nvidia H100 GPUs for advanced reasoning, outperforms models from Google, Anthropic, and OpenAI on math, science, and coding benchmarks. DeepSeek-R1 from ByteDance Research achieves top accuracy on the challenging SuperGPQA dataset. SigLIP 2 from GoogleDeepMind improves semantic understanding and OCR with flexible resolutions and multilingual capabilities, available on HuggingFace. OpenAI's o3-mini-high ranks #1 in coding and math prompts. Perplexity's R1 1776, a post-trained version of DeepSeek R1, is available on Ollama. The Llamba family distills Llama-3.x into efficient recurrent models with higher throughput. AlphaMaze combines DeepSeek R1 with GRPO for visual reasoning on ARC-AGI puzzles. Audiobox Aesthetics from Meta AI offers unified quality assessment for audio. The community notes that Grok 3's compute increase yields only modest performance gains.
o3-mini launches, OpenAI on "wrong side of history"
o3-mini o1 gpt-4o mistral-small-3-24b deepseek-r1 openai mistral-ai deepseek togethercompute fireworksai_hq ai-gradio replicate reasoning safety cost-efficiency model-performance benchmarking api open-weight-models model-releases sam-altman
OpenAI released o3-mini, a new reasoning model available for free and paid users with a "high" reasoning effort option that outperforms the earlier o1 model on STEM tasks and safety benchmarks, costing 93% less per token. Sam Altman acknowledged a shift in open source strategy and credited DeepSeek R1 for influencing assumptions. MistralAI launched Mistral Small 3 (24B), an open-weight model with competitive performance and low API costs. DeepSeek R1 is supported by Text-generation-inference v3.1.0 and available via ai-gradio and replicate. The news highlights advancements in reasoning, cost-efficiency, and safety in AI models.
ModernBert: small new Retriever/Classifier workhorse, 8k context, 2T tokens,
modernbert gemini-2.0-flash-thinking o1 llama answerdotai lightonio hugging-face google-deepmind openai meta-ai-fair figure encoder-only-models long-context alternating-attention natural-language-understanding reasoning robotics-simulation physics-engine humanoid-robots model-performance model-releases jeremyphoward alec-radford philschmid drjimfan bindureddy
Answer.ai/LightOn released ModernBERT, an updated encoder-only model with 8k token context, trained on 2 trillion tokens including code, with 139M/395M parameters and state-of-the-art performance on retrieval, NLU, and code tasks. It features Alternating Attention layers mixing global and local attention. Gemini 2.0 Flash Thinking debuted as #1 in Chatbot Arena, and the O1 model scored top in reasoning benchmarks. Llama downloads surpassed 650 million, doubling in 3 months. OpenAI launched desktop app integrations with voice capabilities. Figure delivered its first humanoid robots commercially. Advances in robotics simulation and a new physics engine Genesis claiming 430,000x faster than real-time were highlighted.
Google wakes up: Gemini 2.0 et al
gemini-2.0-flash gemini-1.5-pro gemini-exp-1206 claude-3.5-sonnet opus google-deepmind openai apple multimodality agent-development multilinguality benchmarking model-releases demis-hassabis sundar-pichai paige-bailey bindureddy
Google DeepMind launched Gemini 2.0 Flash, a new multimodal model outperforming Gemini 1.5 Pro and o1-preview, featuring vision and voice APIs, multilingual capabilities, and native tool use. It powers new AI agents like Project Astra and Project Mariner, with Project Mariner achieving state-of-the-art 83.5% on the WebVoyager benchmark. OpenAI announced ChatGPT integration with Apple devices, enabling Siri access and visual intelligence features. Claude 3.5 Sonnet is noted as a distilled version of Opus. The AI community's response at NeurIPS 2024 has been overwhelmingly positive, signaling a strong comeback for Google in AI innovation. Key topics include multimodality, agent development, multilinguality, benchmarking, and model releases.
Qwen with Questions: 32B open weights reasoning model nears o1 in GPQA/AIME/Math500
deepseek-r1 qwq gpt-4o claude-3.5-sonnet qwen-2.5 llama-cpp deepseek sambanova hugging-face dair-ai model-releases benchmarking fine-tuning sequential-search inference model-deployment agentic-rag external-tools multi-modal-models justin-lin clementdelangue ggerganov vikparuchuri
DeepSeek r1 leads the race for "open o1" models but has yet to release weights, while Justin Lin released QwQ, a 32B open weight model that outperforms GPT-4o and Claude 3.5 Sonnet on benchmarks. QwQ appears to be a fine-tuned version of Qwen 2.5, emphasizing sequential search and reflection for complex problem-solving. SambaNova promotes its RDUs as superior to GPUs for inference tasks, highlighting the shift from training to inference in AI systems. On Twitter, Hugging Face announced CPU deployment for llama.cpp instances, Marker v1 was released as a faster and more accurate deployment tool, and Agentic RAG developments focus on integrating external tools and advanced LLM chains for improved response accuracy. The open-source AI community sees growing momentum with models like Flux gaining popularity, reflecting a shift towards multi-modal AI models including image, video, audio, and biology.
Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo
claude-3-sonnet gpt-4 gemini-1.5 claude-3.5-sonnet anthropic openai langchain meta-ai-fair benchmarking prompt-engineering rag visuotactile-perception ai-governance theoretical-alignment ethical-alignment jailbreak-robustness model-releases alignment richardmcngo andrewyng philschmid
Anthropic released the 3.5 Sonnet benchmark for jailbreak robustness, emphasizing adaptive defenses. OpenAI enhanced GPT-4 with a new RAG technique for contiguous chunk retrieval. LangChain launched Promptim for prompt optimization. Meta AI introduced NeuralFeels with neural fields for visuotactile perception. RichardMCNgo resigned from OpenAI, highlighting concerns on AI governance and theoretical alignment. Discussions emphasized the importance of truthful public information and ethical alignment in AI deployment. The latest Gemini update marks a new #1 LLM amid alignment challenges. The AI community continues to focus on benchmarking, prompt-engineering, and alignment issues.
not much happened today
smollm2 llama-3-2 stable-diffusion-3.5 claude-3.5-sonnet gemini openai anthropic google meta-ai-fair suno-ai perplexity-ai on-device-ai model-performance robotics multimodality ai-regulation model-releases natural-language-processing prompt-engineering agentic-ai ai-application model-optimization sam-altman akhaliq arav-srinivas labenz loubnabenallal1 alexalbert fchollet stasbekman svpino rohanpaul_ai hamelhusain
ChatGPT Search was launched by Sam Altman, who called it his favorite feature since ChatGPT's original launch, doubling his usage. Comparisons were made between ChatGPT Search and Perplexity with improvements noted in Perplexity's web navigation. Google introduced a "Grounding" feature in the Gemini API & AI Studio enabling Gemini models to access real-time web information. Despite Gemini's leaderboard performance, developer adoption lags behind OpenAI and Anthropic. SmolLM2, a new small, powerful on-device language model, outperforms Meta's Llama 3.2 1B. A Claude desktop app was released for Mac and Windows. Meta AI announced robotics advancements including Meta Sparsh, Meta Digit 360, and Meta Digit Plexus. Stable Diffusion 3.5 Medium, a 2B parameter model with a permissive license, was released. Insights on AGI development suggest initial inferiority but rapid improvement. Anthropic advocates for early targeted AI regulation. Discussions on ML specialization predict training will concentrate among few companies, while inference becomes commoditized. New AI tools include Suno AI Personas for music creation, PromptQL for natural language querying over data, and Agent S for desktop task automation. Humor was shared about Python environment upgrades.
Did Nvidia's Nemotron 70B train on test?
nemotron-70b llama-3.1-70b llama-3.1 ministral-3b ministral-8b gpt-4o claude-3.5-sonnet claude-3.5 nvidia mistral-ai hugging-face zep benchmarking reinforcement-learning reward-models temporal-knowledge-graphs memory-layers context-windows model-releases open-source reach_vb philschmid swyx
NVIDIA's Nemotron-70B model has drawn scrutiny despite strong benchmark performances on Arena Hard, AlpacaEval, and MT-Bench, with some standard benchmarks like GPQA and MMLU Pro showing no improvement over the base Llama-3.1-70B. The new HelpSteer2-Preference dataset improves some benchmarks with minimal losses elsewhere. Meanwhile, Mistral released Ministral 3B and 8B models featuring 128k context length and outperforming Llama-3.1 and GPT-4o on various benchmarks under the Mistral Commercial License. NVIDIA's Nemotron 70B also surpasses GPT-4o and Claude-3.5-Sonnet on key benchmarks using RLHF (REINFORCE) training. Additionally, Zep introduced Graphiti, an open-source temporal knowledge graph memory layer for AI agents, built on Neo4j.
not much happened today
llama-3 llama-3-1 grok-2 claude-3.5-sonnet gpt-4-turbo nous-research nvidia salesforce goodfire-ai anthropic x-ai google-deepmind box langchain fine-tuning prompt-caching mechanistic-interpretability model-performance multimodality agent-frameworks software-engineering-agents api document-processing text-generation model-releases vision image-generation efficiency scientific-discovery fchollet demis-hassabis
GPT-5 delayed again amid a quiet news day. Nous Research released Hermes 3 finetune of Llama 3 base models, rivaling FAIR's instruct tunes but sparking debate over emergent existential crisis behavior with 6% roleplay data. Nvidia introduced Minitron finetune of Llama 3.1. Salesforce launched a DEI agent scoring 55% on SWE-Bench Lite. Goodfire AI secured $7M seed funding for mechanistic interpretability work. Anthropic rolled out prompt caching in their API, cutting input costs by up to 90% and latency by 80%, aiding coding assistants and large document processing. xAI released Grok-2, matching Claude 3.5 Sonnet and GPT-4 Turbo on LMSYS leaderboard with vision+text inputs and image generation integration. Claude 3.5 Sonnet reportedly outperforms GPT-4 in coding and reasoning. François Chollet defined intelligence as efficient operationalization of past info for future tasks. Salesforce's DEI framework surpasses individual agent performance. Google DeepMind's Demis Hassabis discussed AGI's role in scientific discovery and safe AI development. Dora AI plugin generates landing pages in under 60 seconds, boosting web team efficiency. Box AI API beta enables document chat, data extraction, and content summarization. LangChain updated Python & JavaScript integration docs.
Gemini Live
gemini-1.5-pro genie falcon-mamba gemini-1.5 llamaindex google anthropic tii supabase perplexity-ai llamaindex openai hugging-face multimodality benchmarking long-context retrieval-augmented-generation open-source model-releases model-integration model-performance software-engineering linear-algebra hugging-face-hub debugging omarsar0 osanseviero dbrxmosaicai alphasignalai perplexity_ai _jasonwei svpino
Google launched Gemini Live on Android for Gemini Advanced subscribers during the Pixel 9 event, featuring integrations with Google Workspace apps and other Google services. The rollout began on 8/12/2024, with iOS support planned. Anthropic released Genie, an AI software engineering system achieving a 57% improvement on SWE-Bench. TII introduced Falcon Mamba, a 7B attention-free open-access model scalable to long sequences. Benchmarking showed that longer context lengths do not always improve Retrieval-Augmented Generation. Supabase launched an AI-powered Postgres service dubbed the "ChatGPT of databases," fully open source. Perplexity AI partnered with Polymarket to integrate real-time probability predictions into search results. A tutorial demonstrated a multimodal recipe recommender using Qdrant, LlamaIndex, and Gemini. An OpenAI engineer shared success tips emphasizing debugging and hard work. The connection between matrices and graphs in linear algebra was highlighted for insights into nonnegative matrices and strongly connected components. Keras 3.5.0 was released with Hugging Face Hub integration for model saving and loading.
Nothing much happened today
chameleon-7b chameleon-30b xlam-1b gpt-3.5 phi-3-mini mistral-7b-v3 huggingface truth_terminal microsoft apple openai meta-ai-fair yi axolotl amd salesforce function-calling multimodality model-releases model-updates model-integration automaticity procedural-memory text-image-video-generation
HuggingFace released a browser-based timestamped Whisper using transformers.js. A Twitter bot by truth_terminal became the first "semiautonomous" bot to secure VC funding. Microsoft and Apple abruptly left the OpenAI board amid regulatory scrutiny. Meta is finalizing a major upgrade to Reddit comments addressing hallucination issues. The Yi model gained popularity on GitHub with 7.4K stars and 454 forks, with potential integration with Axolotl for pregeneration and preprocessing. AMD technologies enable household/small business AI appliances. Meta released Chameleon-7b and Chameleon-30b models on HuggingFace supporting unified text and image tokenization. Salesforce's xLAM-1b model outperforms GPT-3.5 in function calling despite its smaller size. Anole pioneered open-source multimodal text-image-video generation up to 720p 144fps. Phi-3 Mini expanded from 3.8B to 4.7B parameters with function calling, competing with Mistral-7b v3. "System 2 distillation" in humans relates to automaticity and procedural memory.
Not much happened today
gpt-4o gemini-1.5-pro gemini-1.5-flash imagen-3 veo reka-core qwen-1.5-110b openai google-deepmind anthropic rekailabs alibaba salesforce multimodality long-context model-releases reinforcement-learning model-benchmarking text-to-image video-generation ai-assistants ilya-sutskever jakub-pachocki mike-krieger sama
Ilya Sutskever steps down as Chief Scientist at OpenAI after nearly a decade, with Jakub Pachocki named as his successor. Google DeepMind announces Gemini 1.5 Pro and Gemini 1.5 Flash models featuring 2 million token context and improved multimodal capabilities, alongside demos of Project Astra AI assistant, Imagen 3 text-to-image model, and Veo generative video model. GPT-4o tops the VHELM leaderboard and outperforms competitors on LMSYS Chatbot Arena. Reka Core multimodal model with 128K context and Alibaba's Qwen1.5-110B open-source model are released. Salesforce shares an online RLHF recipe.