Model: "o3"

claude-3.5-sonnet claude-3-haiku claude-3-haiku-4.5 gpt-5 gpt-4.1 gemma-2.5 gemma o3 anthropic google yale artificial-analysis shanghai-ai-lab model-performance fine-tuning reasoning agent-evaluation memory-optimization model-efficiency open-models cost-efficiency foundation-models agentic-workflows swyx sundarpichai osanseviero clementdelangue deredleritt3r azizishekoofeh vikhyatk mirrokni pdrmnvd akhaliq sayashk gne

Anthropic released Claude Haiku 4.5, a model that is over 2x faster and 3x cheaper than Claude Sonnet 4.5, improving iteration speed and user experience significantly. Pricing comparisons highlight Haiku 4.5's competitive cost against models like GPT-5 and GLM-4.6. Google and Yale introduced the open-weight Cell2Sentence-Scale 27B (Gemma) model, which generated a novel, experimentally validated cancer hypothesis, with open-sourced weights for community use. Early evaluations show GPT-5 and o3 models outperform GPT-4.1 in agentic reasoning tasks, balancing cost and performance. Agent evaluation challenges and memory-based learning advances were also discussed, with contributions from Shanghai AI Lab and others. "Haiku 4.5 materially improves iteration speed and UX," and "Cell2Sentence-Scale yielded validated cancer hypothesis" were key highlights.

Aug 14

Western Open Models get Funding: Cohere $500m @ 6.8B, AI2 gets $152m NSF+NVIDIA grants

gpt-5 o3 command-a gemma-3-270m imagen-4 dinov3 openai perplexity-ai ai2 nvidia cohere meta-ai-fair google hugging-face ollama unsloth model-speed funding ai-infrastructure on-device-ai quantization embedding-models image-generation self-supervised-learning vision dense-prediction benchmarking instruction-following model-optimization model-release challenge joelle_pineau fchollet awnihannun _philschmid osanseviero

OpenAI's GPT-5 achieved a speedrun of Pokemon Red 3x faster than o3. Perplexity raised $200M at a $20B valuation. AI2 secured $75M NSF grants and $77M from NVIDIA for AI infrastructure projects like Olmo and Molmo. Cohere raised $500M and hired Joelle Pineau from meta-ai-fair, boosting models like Command A. Google released the Gemma 3 270M on-device tiny LLM with INT4 QAT checkpoints and large embedding tables, and made Imagen 4 generally available with a fast version at $0.02/image. Meta-ai-fair introduced DINOv3, a family of self-supervised vision foundation models with high-resolution dense features and strong performance on benchmarks like COCO detection and ADE20K segmentation, under a permissive license. A $150,000 MiniMax AI Agent Challenge is ongoing with 200+ prizes, encouraging AI project builds by August 25.

Jul 28

GLM-4.5: Deeper, Headier, & better than Kimi/Qwen/DeepSeek (SOTA China LLM?)

glm-4.5-355b-a32b glm-4.5-air-106b-a12b qwen3-coder claude-4-opus grok-4 o3 gpt-4.1 gpt-5 kimi-k2 claude-sonnet-4 z-ai alibaba huggingface openai reinforcement-learning token-efficiency model-optimization open-source-models agentic-ai coding model-training lupantech teortaxestex mervenoyann _lewtun scaling01 cline

Z.ai (Zhipu AI) released the GLM-4.5-355B-A32B and GLM-4.5-Air-106B-A12B open weights models, claiming state-of-the-art performance competitive with Claude 4 Opus, Grok 4, and OpenAI's o3. These models emphasize token efficiency and efficient reinforcement learning training validated by the Muon optimizer. Alibaba Qwen introduced Group Sequence Policy Optimization (GSPO), a new reinforcement learning algorithm powering the Qwen3 model suite, integrated into Hugging Face's TRL library. Speculation surrounds mystery models "summit" and "zenith" as potential GPT-5 variants based on GPT-4.1 architecture. Qwen3-Coder shows strong coding benchmark results, rivaling Claude Sonnet 4 and Kimi K2. The rise of powerful Chinese open-source models like GLM-4.5, Wan-2.2, and Qwen3 Coder contrasts with a slowdown from Western labs such as OpenAI.

Jul 17

ChatGPT Agent: new o* model + unified Deep Research browser + Operator computer use + Code Interpreter terminal

o3 o4 gptnext openai reinforcement-learning benchmarking model-performance model-risk long-context model-deployment fine-tuning sama gdb kevinweil xikun_zhang_ keren_gu boazbaraktcs

OpenAI launched the ChatGPT Agent, a new advanced AI system capable of browsing the web, coding, analyzing data, and creating reports, marking a significant step towards human-like computer use. The agent, distinct from and superior to o3, is considered the first public exposure of what was internally called o4, now merged into GPTNext. It features end-to-end reinforcement learning, can operate for extended periods (tested up to 2 hours), and is classified as "High" risk for biological misuse, with safeguards activated. Early benchmarks show mixed results, excelling in some tests like WebArena and BrowserComp but underperforming on others like PaperBench. Key figures involved include Sam Altman, Greg Brockman, and Kevin Weil, with technical insights from xikun_zhang_ and risk commentary from KerenGu and boazbaraktcs. The launch sparked speculation about GPT-5, which was confirmed not to be the case.

Jul 02

not much happened today

gemma-3n glm-4.1v-thinking deepseek-r1t2 mini-max-m1 o3 claude-4-opus claude-sonnet moe-72b meta scale-ai unslothai zhipu-ai deepseek huawei minimax-ai allenai sakana-ai-labs openai model-performance vision conv2d float16 training-loss open-source model-benchmarks moe load-balancing scientific-literature-evaluation code-generation adaptive-tree-search synthesis-benchmarks alexandr_wang natfriedman steph_palazzolo thegregyang teortaxes_tex denny_zhou agihippo danielhanchen osanseviero reach_vb scaling01 ndea

Meta has hired Scale AI CEO Alexandr Wang as its new Chief AI Officer, acquiring a 49% non-voting stake in Scale AI for $14.3 billion, doubling its valuation to ~$28 billion. This move is part of a major talent shuffle involving Meta, OpenAI, and Scale AI. Discussions include the impact on Yann LeCun's influence at Meta and potential responses from OpenAI. In model news, Gemma 3N faces technical issues like vision NaNs and FP16 overflows, with fixes from UnslothAI. Chinese open-source models like GLM-4.1V-Thinking by Zhipu AI and DeepSeek R1T2 show strong performance and speed improvements. Huawei open-sourced a 72B MoE model with a novel load balancing solution. The MiniMax-M1 hybrid MoE model leads math benchmarks on the Text Arena leaderboard. AllenAI launched SciArena for scientific literature evaluation, where o3 outperforms others. Research from Sakana AI Labs introduces AB-MCTS for code generation, improving synthesis benchmarks.

Jun 11

Execuhires Round 2: Scale-Meta, Lamini-AMD, and Instacart-OpenAI

o3-pro o3 o1-pro gpt-4o gpt-4.1 gpt-4.1-mini gpt-4.1-nano meta-ai-fair scale-ai lamini amd openai gemini google anthropic model-release benchmarking reasoning fine-tuning pricing model-performance direct-preference-optimization complex-problem-solving alexandr_wang sharon_zhou fidji_simo sama jack_rae markchen90 kevinweil gdb gregkamradt lechmazur wesrothmoney paul_cal imjaredz cto_junior johnowhitaker polynoamial scaling01

Meta hires Scale AI's Alexandr Wang to lead its new "Superintelligence" division following a $15 billion investment for a 49% stake in Scale. Lamini's Sharon Zhou joins AMD as VP of AI under Lisa Su, while Instacart's Fidji Simo becomes CEO of Apps at OpenAI under Sama. Meta offers over $10 million/year compensation packages to top researchers, successfully recruiting Jack Rae from Gemini. OpenAI releases o3-pro model to ChatGPT Pro users and API, outperforming o3 and setting new benchmarks like Extended NYT Connections and SnakeBench. Despite being slower than o1-pro, o3-pro excels in reasoning and complex problem-solving. OpenAI cuts o3 pricing by 80%, making it cheaper than GPT-4o and pressuring competitors like Google and Anthropic to lower prices. Users can now fine-tune the GPT-4.1 family using direct preference optimization (DPO) for subjective tasks.

Jun 10

Reasoning Price War 2: Mistral Magistral + o3's 80% price cut + o3-pro

o3 o3-pro gpt-4.1 claude-4-sonnet gemini-2.5-pro magistral-small magistral-medium mistral-small-3.1 openai anthropic google-deepmind mistral-ai perplexity-ai reasoning token-efficiency price-cut benchmarking open-source model-releases context-windows gpu-optimization swyx sama scaling01 polynoamial nrehiew_ kevinweil gdb flavioad stevenheidel aravsrinivas

OpenAI announced an 80% price cut for its o3 model, making it competitively priced with GPT-4.1 and rivaling Anthropic's Claude 4 Sonnet and Google's Gemini 2.5 Pro. Alongside, o3-pro was released as a more powerful and reliable variant, though early benchmarks showed mixed performance relative to cost. Mistral AI launched its Magistral reasoning models, including an open-source 24B parameter version optimized for efficient deployment on consumer GPUs. The price reduction and new model releases signal intensified competition in reasoning-focused large language models, with notable improvements in token efficiency and cost-effectiveness.

Jun 02

not much happened today

deepseek-r1-0528 o3 gemini-2.5-pro claude-opus-4 deepseek_ai openai gemini meta-ai-fair anthropic x-ai ollama hugging-face alibaba bytedance xiaomi reasoning reinforcement-learning benchmarking quantization local-inference model-evaluation open-weights transparency post-training agentic-benchmarks long-context hallucination-detection teortaxestex wenfeng danielhanchen awnihannun reach_vb abacaj

DeepSeek R1-0528 release brings major improvements in reasoning, hallucination reduction, JSON output, and function calling, matching or surpassing closed models like OpenAI o3 and Gemini 2.5 Pro on benchmarks such as Artificial Analysis Intelligence Index, LiveBench, and GPQA Diamond. The model ranks #2 globally in open weights intelligence, surpassing Meta AI, Anthropic, and xAI. Open weights and technical transparency have fueled rapid adoption across platforms like Ollama and Hugging Face. Chinese AI labs including DeepSeek, Alibaba, ByteDance, and Xiaomi now match or surpass US labs in model releases and intelligence, driven by open weights strategies. Reinforcement learning post-training is critical for intelligence gains, mirroring trends seen at OpenAI. Optimized quantization techniques (1-bit, 4-bit) and local inference enable efficient experimentation on consumer hardware. New benchmarks like LisanBench test knowledge, planning, memory, and long-context reasoning, with OpenAI o3 and Claude Opus 4 leading. Discussions highlight concerns about benchmark contamination and overemphasis on RL-tuned gains.

May 27

Mistral's Agents API and the 2025 LLM OS

qwen claude-4 chatgpt o3 o4 mistral-ai langchain-ai openai meta-ai-fair agent-frameworks multi-agent-systems tool-use code-execution web-search model-context-protocol persistent-memory function-calling open-source no-code reinforcement-learning model-performance agent-orchestration omarsar0 simonw swyx scaling01

The LLM OS concept has evolved since 2023, with Mistral AI releasing a new Agents API that includes code execution, web search, persistent memory, and agent orchestration. LangChainAI introduced the Open Agent Platform (OAP), an open-source no-code platform for intelligent agents. OpenAI plans to develop ChatGPT into a super-assistant by H1 2025, competing with Meta. Discussions around Qwen models focus on reinforcement learning effects, while Claude 4 performance is also noted. The AI Engineer World's Fair is calling for volunteers.

May 26

not much happened today

chatgpt o3 o4 bagel-7b medgemma acereason-nemotron-14b codex gemini openai bytedance google nvidia sakana-ai-labs deep-learning-ai gemini agenticseek anthropic agentic-systems multimodality reasoning code-generation prompt-engineering privacy ethical-ai emergence synthetic-data speech-instruction-tuning low-resource-languages humor scaling01 mervenoyann sakananailabs _philschmid omarsar0 teortaxestex andrewlampinen sedielem cis_female

OpenAI plans to evolve ChatGPT into a super-assistant by 2025 with models like o3 and o4 enabling agentic tasks and supporting a billion users. Recent multimodal and reasoning model releases include ByteDance's BAGEL-7B, Google's MedGemma, and NVIDIA's ACEReason-Nemotron-14B. The Sudoku-Bench Leaderboard highlights ongoing challenges in AI creative reasoning. In software development, OpenAI's Codex aids code generation and debugging, while Gemini's Context URL tool enhances prompt context. AgenticSeek offers a local, privacy-focused alternative for autonomous agents. Ethical concerns are raised about AGI development priorities and Anthropic's alignment with human values. Technical discussions emphasize emergence in AI and training challenges, with humor addressing misconceptions about Gemini 3.0 and async programming in C. A novel synthetic speech training method enables instruction tuning of LLMs without real speech data, advancing low-resource language support.

May 15

Gemini's AlphaEvolve agent uses Gemini 2.0 to find new Math and cuts Gemini cost 1% — without RL

gemini gpt-4.1 gpt-4o-mini o3 o4-mini google-deepmind openai algorithm-discovery coding-agents matrix-multiplication optimization reinforcement-learning model-weights training-efficiency safety-evaluations instruction-following coding-tasks model-releases _philschmid scott_swingle alex_dimakis henry jason_wei kevinweil michpokrass scaling01 gdb

Deepmind's AlphaEvolve, a 2025 update to AlphaTensor and FunSearch, is a Gemini-powered coding agent for algorithm discovery that designs faster matrix multiplication algorithms, solves open math problems, and improves data center and AI training efficiency. It achieves a 23% faster kernel speedup in Gemini training and surpasses state-of-the-art on 20% of applied problems, including improvements on the Minimum Overlap Problem and Kissing number problem. Unlike Deep-RL, it optimizes code pieces rather than model weights. Meanwhile, OpenAI released GPT-4.1 in ChatGPT, specializing in coding and instruction following, with a faster alternative GPT-4.1 mini replacing GPT-4o mini for all users. OpenAI also launched the Safety Evaluations Hub and the OpenAI to Z Challenge using o3/o4 mini and GPT-4.1 models to discover archaeological sites. "Maybe midtrain + good search is all you need for AI for scientific innovation" - Jason Wei.

May 13

not much happened today

hunyuan-turbos qwen3-235b-a22b o3 gpt-4.1-nano grok-3 gemini-2.5-pro seed1.5-vl kling-2.0 tencent openai bytedance meta-ai-fair nvidia deepseek benchmarking model-performance moe reasoning vision video-understanding vision-language multimodality model-evaluation model-optimization lmarena_ai artificialanlys gdb _jasonwei iScienceLuvr _akhaliq _philschmid teortaxesTex mervenoyann reach_vb

Tencent's Hunyuan-Turbos has risen to #8 on the LMArena leaderboard, showing strong performance across major categories and significant improvement since February. The Qwen3 model family, especially the Qwen3 235B-A22B (Reasoning) model, is noted for its intelligence and efficient parameter usage. OpenAI introduced HealthBench, a new health evaluation benchmark developed with input from over 250 physicians, where models like o3, GPT-4.1 nano, and Grok 3 showed strong results. ByteDance released Seed1.5-VL, a vision-language model with a 532M-parameter vision encoder and a 20B active parameter MoE LLM, achieving state-of-the-art results on 38 public benchmarks. In vision-language, Kling 2.0 leads image-to-video generation, and Gemini 2.5 Pro excels in video understanding with advanced multimodal capabilities. Meta's Vision-Language-Action framework and updates on VLMs for 2025 were also highlighted.

May 02

not much happened today

qwen3-14b qwen3-32b qwen3-235b phi-4-reasoning o3-mini command-a gemini-2.5-pro o4-mini olm-o2-1b o3 alibaba together-ai scaling01 microsoft deepseek cohere google epoch-ai-research inception-labs openai allenai quantization fine-tuning reinforcement-learning benchmarking video-generation diffusion-models model-performance model-evaluation model-release text-generation cline _philschmid iscienceluvr alexalbert__ _lewtun teortaxestex sarahookr reach_vb

Qwen model family released quantized versions of Qwen3 models including 14B, 32B, and 235B parameters, with promising coding capabilities in Qwen3-235B. Microsoft launched Phi-4-reasoning, a 14B parameter model distilled from OpenAI's o3-mini, emphasizing supervised fine-tuning and reinforcement learning, outperforming larger models in some benchmarks. Cohere's Command A leads SQL performance on Bird Bench. Google introduced the TRAJAN eval for video generation temporal consistency and updated the Gemini OpenAI compatibility layer. Inception Labs launched a diffusion LLM API claiming 5x speed improvements over autoregressive models. Community rankings show OpenAI's o3 model debuting strongly in web app-building tasks. Other releases include AllenAI's OLMo2 1B and additional Phi 4 variants. "Qwen3-235B shows promise for coding" and "Phi-4-reasoning tech report emphasizes SFT gains" highlight key advancements.

Apr 24

not much happened today

gpt-image-1 o3 o4-mini gpt-4.1 dam openai google anthropic epoch ai research image-generation model-benchmarks vision-language-models music-ai ai-experiences ai-research supercomputers

AI news for April 23-24, 2025, covering new model releases, benchmarks, and research developments from companies like openai, google deepmind, anthropic, and epoch ai research.

Apr 23

gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API

gpt-image-1 o3 o4-mini gpt-4.1 eagle-2.5-8b gpt-4o qwen2.5-vl-72b openai nvidia hugging-face x-ai image-generation content-moderation benchmarking long-context multimodality model-performance supercomputing virology video-understanding model-releases kevinweil lmarena_ai _philschmid willdepue arankomatsuzaki epochairesearch danhendrycks reach_vb mervenoyann _akhaliq

OpenAI officially launched the gpt-image-1 API for image generation and editing, supporting features like alpha channel transparency and a "low" content moderation policy. OpenAI's models o3 and o4-mini are leading in benchmarks for style control, math, coding, and hard prompts, with o3 ranking #1 in several categories. A new benchmark called Vending-Bench reveals performance variance in LLMs on extended tasks. GPT-4.1 ranks in the top 5 for hard prompts and math. Nvidia's Eagle 2.5-8B matches GPT-4o and Qwen2.5-VL-72B in long-video understanding. AI supercomputer performance doubles every 9 months, with xAI's Colossus costing an estimated $7 billion and the US dominating 75% of global performance. The Virology Capabilities Test shows OpenAI's o3 outperforms 94% of expert virologists. Nvidia also released the Describe Anything Model (DAM), a multimodal LLM for detailed image and video captioning, now available on Hugging Face.

Apr 19

Grok 3 & 3-mini now API Available

grok-3 grok-3-mini gemini-2.5-flash o3 o4-mini llama-4-maverick gemma-3-27b openai llamaindex google-deepmind epochairesearch goodfireai mechanize agent-development agent-communication cli-tools reinforcement-learning model-evaluation quantization-aware-training model-compression training-compute hybrid-reasoning model-benchmarking

Grok 3 API is now available, including a smaller version called Grok 3 mini, which offers competitive pricing and full reasoning traces. OpenAI released a practical guide for building AI agents, while LlamaIndex supports the Agent2Agent protocol for multi-agent communication. Codex CLI is gaining traction with new features and competition from Aider and Claude Code. GoogleDeepMind launched Gemini 2.5 Flash, a hybrid reasoning model topping the Chatbot Arena leaderboard. OpenAI's o3 and o4-mini models show emergent behaviors from large-scale reinforcement learning. EpochAIResearch updated its methodology, removing Maverick from high FLOP models as Llama 4 Maverick training compute drops. GoodfireAI announced a $50M Series A for its Ember neural programming platform. Mechanize was founded to build virtual work environments and automation benchmarks. GoogleDeepMind's Quantisation Aware Training for Gemma 3 models reduces model size significantly, with open source checkpoints available.

Apr 18

Gemini 2.5 Flash completes the total domination of the Pareto Frontier

gemini-2.5-flash o3 o4-mini google openai anthropic tool-use multimodality benchmarking reasoning reinforcement-learning open-source model-releases chain-of-thought coding-agent sama kevinweil markchen90 alexandr_wang polynoamial scaling01 aidan_mclau cwolferesearch

Gemini 2.5 Flash is introduced with a new "thinking budget" feature offering more control compared to Anthropic and OpenAI models, marking a significant update in the Gemini series. OpenAI launched o3 and o4-mini models, emphasizing advanced tool use capabilities and multimodal understanding, with o3 dominating several leaderboards but receiving mixed benchmark reviews. The importance of tool use in AI research and development is highlighted, with OpenAI Codex CLI announced as a lightweight open-source coding agent. The news reflects ongoing trends in AI model releases, benchmarking, and tool integration.

Apr 17

OpenAI o3, o4-mini, and Codex CLI

o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01

OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.

Apr 11

not much happened today

gpt-4.1 o3 o4-mini grok-3 grok-3-mini o1 tpuv7 gb200 openai x-ai google nvidia samsung memory model-release hardware-accelerators fp8 hbm inference ai-conferences agent-collaboration robotics model-comparison performance power-consumption sama

OpenAI teased a Memory update in ChatGPT with limited technical details. Evidence suggests upcoming releases of o3 and o4-mini models, alongside a press leak about GPT-4.1. X.ai launched the Grok 3 and Grok 3 mini APIs, confirmed as o1 level models. Discussions compared Google's TPUv7 with Nvidia's GB200, highlighting TPUv7's specs like 4,614 TFLOP/s FP8 performance, 192 GB HBM, and 1.2 Tbps ICI bandwidth. TPUv7 may have pivoted from training to inference chip use. Key AI events include Google Cloud Next 2025 and Samsung's Gemini-powered Ballie robot. The community is invited to participate in the AI Engineer World's Fair 2025 and the 2025 State of AI Engineering survey.

Apr 05

not much happened today

o3 o4-mini gpt-5 sonnet-3.7 gemma-3 qwen-2.5-vl gemini-2.5-pro gemma-7b llama-3-1-405b openai deepseek anthropic google meta-ai-fair inference-scaling reward-modeling coding-models ocr model-preview rate-limiting model-pricing architectural-advantage benchmarking long-form-reasoning attention-mechanisms mixture-of-experts gpu-throughput sama akhaliq nearcyan fchollet reach_vb philschmid teortaxestex epochairesearch omarsar0

OpenAI announced that o3 and o4-mini models will be released soon, with GPT-5 expected in a few months, delayed for quality improvements and capacity planning. DeepSeek introduced Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability for generalist reward models. Anthropic's Sonnet 3.7 remains a top coding model. Google's Gemma 3 is available on KerasHub, and Qwen 2.5 VL powers a new Apache 2.0 licensed OCR model. Gemini 2.5 Pro entered public preview with increased rate limits and pricing announced, becoming a preferred model for many tasks except image generation. Meta's architectural advantage and the FrontierMath benchmark challenge AI's long-form reasoning and worldview development. Research reveals LLMs focus attention on the first token as an "attention sink," preserving representation diversity, demonstrated in Gemma 7B and LLaMa 3.1 models. MegaScale-Infer offers efficient serving of large-scale Mixture-of-Experts models with up to 1.90x higher per-GPU throughput.

Feb 15

not much happened today

chatgpt-4o deepseek-r1 o3 o3-mini gemini-2-flash qwen-2.5 qwen-0.5b hugging-face openai perplexity-ai deepseek-ai gemini qwen metr_evals reasoning benchmarking model-performance prompt-engineering model-optimization model-deployment small-language-models mobile-ai ai-agents speed-optimization _akhaliq aravsrinivas lmarena_ai omarsar0 risingsayak

Smolagents library by Huggingface continues trending. ChatGPT-4o latest version chatgpt-40-latest-20250129 released. DeepSeek R1 671B sets speed record at 198 t/s, fastest reasoning model, recommended with specific prompt settings. Perplexity Deep Research outperforms models like Gemini Thinking, o3-mini, and DeepSeek-R1 on Humanity's Last Exam benchmark with 21.1% score and 93.9% accuracy on SimpleQA. ChatGPT-4o ranks #1 on Arena leaderboard in multiple categories except math. OpenAI's o3 model powers Deep Research tool for ChatGPT Pro users. Gemini 2 Flash and Qwen 2.5 models support LLMGrading verifier. Qwen 2.5 models added to PocketPal app. MLX shows small LLMs like Qwen 0.5B generate tokens at high speed on M4 Max and iPhone 16 Pro. Gemini Flash 2.0 leads new AI agent leaderboard. DeepSeek R1 is most liked on Hugging Face with over 10 million downloads.

Feb 14

Reasoning Models are Near-Superhuman Coders (OpenAI IOI, Nvidia Kernels)

o3 o1 o3-mini deepseek-r1 qwen-2.5 openthinker openai nvidia ollama elevenlabs sakana-ai apple reinforcement-learning gpu-kernel-optimization fine-tuning knowledge-distillation scaling-laws chain-of-thought-reasoning model-accessibility alex-wei karpathy abacaj awnihannun

o3 model achieved a gold medal at the 2024 IOI and ranks in the 99.8 percentile on Codeforces, outperforming most humans with reinforcement learning (RL) methods proving superior to inductive bias approaches. Nvidia's DeepSeek-R1 autonomously generates GPU kernels that surpass some expert-engineered kernels, showcasing simple yet effective AI-driven optimization. OpenAI updated o1 and o3-mini models to support file and image uploads in ChatGPT and released DeepResearch, a powerful research assistant based on the o3 model with RL for deep chain-of-thought reasoning. Ollama introduced OpenThinker models fine-tuned from Qwen2.5, outperforming some DeepSeek-R1 distillation models. ElevenLabs grew into a $3.3 billion company specializing in AI voice synthesis without open-sourcing their technology. Research highlights include Sakana AI Labs' TAID knowledge distillation method receiving a Spotlight at ICLR 2025, and Apple's work on scaling laws for mixture-of-experts (MoEs). The importance of open-source AI for scientific discovery was also emphasized.

Feb 13

small news items

gpt-4.5 gpt-5 deepseek-r1-distilled-qwen-1.5b o1-preview modernbert-0.3b qwen-0.5b o3 openai ollama mistral perplexity cerebras alibaba groq bytedance math benchmarking fine-tuning model-performance reinforcement-learning model-architecture partnerships funding jeremyphoward arankomatsuzaki sama nrehiew_ danhendrycks akhaliq

OpenAI announced plans for GPT-4.5 (Orion) and GPT-5, with GPT-5 integrating the o3 model and offering unlimited chat access in the free tier. DeepSeek R1 Distilled Qwen 1.5B outperforms OpenAI's o1-preview on math benchmarks, while ModernBERT 0.3b surpasses Qwen 0.5b at MMLU without fine-tuning. Mistral and Perplexity adopt Cerebras hardware for 10x performance gains. OpenAI's o3 model won a gold medal at the 2024 International Olympiad in Informatics. Partnerships include Qwen with Groq. Significant RLHF activity is noted in Nigeria and the global south, and Bytedance is expected to rise in AI prominence soon. "GPT5 is all you need."

Feb 04

OpenAI takes on Gemini's Deep Research

o3 o3-mini-high o3-deep-research-mini openai google-deepmind nyu uc-berkeley hku reinforcement-learning benchmarking inference-speed model-performance reasoning test-time-scaling agent-design sama danhendrycks ethan-mollick dan-shipper

OpenAI released the full version of the o3 agent, with a new Deep Research variant showing significant improvements on the HLE benchmark and achieving SOTA results on GAIA. The release includes an "inference time scaling" chart demonstrating rigorous research, though some criticism arose over public test set results. The agent is noted as "extremely simple" and currently limited to 100 queries/month, with plans for a higher-rate version. Reception has been mostly positive, with some skepticism. Additionally, advances in reinforcement learning were highlighted, including a simple test-time scaling technique called budget forcing that improved reasoning on math competitions by 27%. Researchers from Google DeepMind, NYU, UC Berkeley, and HKU contributed to these findings. The original Gemini Deep Research team will participate in the upcoming AI Engineer NYC event.

Dec 25, 2024

not much happened today

qwen-o1 qvq claude-3.5-sonnet gpt-4o o3 o3-mini alibaba openai mit idsia llamaindex ollama vision benchmarking llm-calibration intentionality alignment-faking deliberative-alignment artificial-life gdpr-compliance contract-review-agent app-creation synthetic-data post-transformers smol-models agents bret-taylor

The Qwen team launched QVQ, a vision-enabled version of their experimental QwQ o1 clone, benchmarking comparably to Claude 3.5 Sonnet. Discussions include Bret Taylor's insights on autonomous software development distinct from the Copilot era. The Latent Space LIVE! talks cover highlights of 2024 AI startups, vision, open models, post-transformers, synthetic data, smol models, and agents. Twitter recaps by Claude 3.5 Sonnet highlight proposals for benchmarks measuring LLM calibration and falsehood confidence, with QVQ outperforming GPT-4o and Claude Sonnet 3.5. AI alignment debates focus on intentionality and critiques of alignment faking in models like Claude. Updates from OpenAI include new o3 and o3-mini models and a deliberative alignment strategy. The ASAL project is a collaboration between MIT, OpenAI, and Swiss AI Lab IDSIA to automate artificial life discovery. Personal stories reveal frustrations with USCIS green card denials despite high qualifications. New tools like GeminiCoder enable rapid app creation, and a contract review agent using Reflex and Llama Index checks GDPR compliance. Holiday greetings and memes were also shared.

Dec 24, 2024

not much happened this weekend

o3 o1 opus sonnet octave openai langchain hume x-ai amd nvidia meta-ai-fair hugging-face inference-time-scaling model-ensembles small-models voice-cloning fine-math-dataset llm-agent-framework benchmarking software-stack large-concept-models latent-space-reasoning mechanistic-interpretability planning speech-language-models lisa-su clementdelangue philschmid neelnanda5

o3 model gains significant attention with discussions around its capabilities and implications, including an OpenAI board member referencing "AGI." LangChain released their State of AI 2024 survey. Hume announced OCTAVE, a 3B parameter API-only speech-language model with voice cloning. x.ai secured a $6B Series C funding round. Discussions highlight inference-time scaling, model ensembles, and the surprising generalization ability of small models. New tools and datasets include FineMath, the best open math dataset on Hugging Face, and frameworks for LLM agents. Industry updates cover a 5-month benchmarking of AMD MI300X vs Nvidia H100 + H200, insights from a meeting with Lisa Su on AMD's software stack, and open AI engineering roles. Research innovations include Large Concept Models (LCM) from Meta AI, Chain of Continuous Thought (Coconut) for latent space reasoning, and mechanistic interpretability initiatives.

Dec 21, 2024

o3 solves AIME, GPQA, Codeforces, makes 11 years of progress in ARC-AGI and 25% in FrontierMath

o3 o3-mini o1-mini gpt-3 gpt-4o o1 openai benchmarking math reasoning model-performance inference-speed cost-efficiency alignment safety-testing sama eric-wallace

OpenAI announced the o3 and o3-mini models with groundbreaking benchmark results, including a jump from 2% to 25% on the FrontierMath benchmark and 87.5% on the ARC-AGI reasoning benchmark, representing about 11 years of progress on the GPT3 to GPT4o scaling curve. The o1-mini model shows superior inference efficiency compared to o3-full, promising significant cost reductions on coding tasks. The announcement was accompanied by community discussions, safety testing applications, and detailed analyses. Sama highlighted the unusual cost-performance tradeoff, and Eric Wallace shared insights on the o-series deliberative alignment strategy.