AINews
subscribe / issues / tags /

AINews

by smol.ai

How over 80k top AI Engineers keep up, every weekday.

Thanks for subscribing! Please check your email to confirm your subscription.

We respect your privacy. Full signup link here.

We summarize top AI discords + AI reddits + AI X/Twitters, and send you a roundup each day!

"Highest-leverage 45 mins I spend everyday" - Soumith

" best AI newsletter atm " and " I'm not sure that enough people subscribe " - Andrej

"genuinely incredible" - Chris

"surprisingly decent" - Hamel

You can pay for a customizable version here . Thanks to Pieter Levels for the Lex Fridman feature!

Last 30 days in AI

Invalid regex
See all issues
  • Nov 14
    not much happened today
    gpt-5.1 sonnet-4.5 opus-4.1 gemini-3 openai anthropic langchain-ai google-deepmind adaptive-reasoning developer-tools prompt-optimization json-schema agent-workflows context-engineering structured-outputs model-release benchmarking swyx allisontam_ gdb sama alexalbert__ simonw omarsar0 abacaj scaling01 amandaaskell
    OpenAI launched GPT-5.1 featuring "adaptive reasoning" and developer-focused API improvements, including prompt caching and a reasoning_effort toggle for latency/cost tradeoffs. Independent analysis shows a minor intelligence bump with significant gains in agentic coding benchmarks. Anthropic's Claude models introduced structured outputs with JSON schema compliance in public beta for Sonnet 4.5 and Opus 4.1, enhancing tooling and code execution workflows. Rumors of an Opus 4.5 release were debunked. LangChain released a "Deep Agents" package and context-engineering playbook to optimize agent workflows. The community is eagerly anticipating Google DeepMind's Gemini 3 model, hinted at in social media and upcoming AIE CODE events. "Tickets are sold out, but side events and volunteering opportunities are available."
  • Nov 13
    minor updates to GPT 5.1 and SIMA 2
    gpt-5.1 gpt-5.1-codex gpt-5.1-codex-mini sima-2 gemini openai google-deepmind github microsoft cursor_ai perplexity-ai weaviate llamaindex adaptive-reasoning agentic-coding tool-use context-engineering memory-architecture self-improvement retrieval-augmentation database-query-planning chart-parsing robotics sama allisontam_ cline cognition demishassabis omarsar0 helloiamleonie
    OpenAI released GPT-5.1 family models including 5.1-Codex and 5.1-Codex-Mini with improved steerability, faster responses, and new tools like apply_patch and shell command execution. Pricing remains unchanged from 5.0. Immediate integrations include GitHub Copilot, VS Code, Cursor, and Perplexity adopting GPT-5.1 models. Google DeepMind announced SIMA 2, a Gemini-powered agent capable of language instruction following, planning, and self-improvement without human feedback, targeting robotics applications. New research on context engineering and agentic tool use patterns was published, with contributions from Weaviate and LlamaIndex on database query planning and chart parsing respectively. "Adaptive reasoning" and agentic coding improvements are highlighted in GPT-5.1- Instant.
  • Nov 12
    GPT 5.1 in ChatGPT: No evals, but adaptive thinking and instruction following
    gpt-5.1 gpt-5.0 claude isaac-0.1 qwen3vl-235b glm-4.6 gemini openai anthropic waymo perceptron langchain llamaindex nousresearch adaptive-reasoning instruction-following personalization autonomous-driving robotics multimodality agent-evaluation agent-governance middleware structured-extraction benchmarking dmitri_dolgov jeffdean fidji_simo akshats07
    OpenAI launched GPT-5.1 with improvements in conversational tone, instruction following, and adaptive reasoning. GPT-5.0 is being sunset in 3 months. ChatGPT introduces new tone toggles for personalization, serving over 800 million users. Waymo rolls out freeway driving for public riders in major California cities, showcasing advances in autonomous driving. Anthropic's Project Fetch explores LLMs as robotics copilots using Claude. Perceptron releases a new API and Python SDK for multimodal perception-action apps supporting Isaac-0.1 and Qwen3VL-235B. Code Arena offers live coding evaluations supporting Claude, GPT-5, GLM-4.6, and Gemini. LangChain introduces middleware for agent governance with human-in-the-loop controls. LlamaIndex releases a structured extraction template for SEC filings using LlamaAgents. NousResearch promotes ARC Prize benchmarks for generalized intelligence evaluation.
  • Nov 11
    not much happened today
    gpt-5 qwen2.5-7b ernie-4.5-vl-28b-a3b-thinking gemini-2.5-pro llamacloud claude-code openai baidu databricks llamaindex togethercompute sakanaailabs reasoning-benchmarks reinforcement-learning fine-tuning multimodality document-intelligence retrieval-augmented-generation agentic-systems persona-simulation code-agents guardrails sakanaailabs micahgoldblum francoisfleuret matei_zaharia jerryjliu0 omarsar0 togethercompute imjaredz theo
    GPT-5 leads Sudoku-Bench solving 33% of puzzles but 67% remain unsolved, highlighting challenges in meta-reasoning and spatial logic. New training methods like GRPO fine-tuning and "Thought Cloning" show limited success. Research on "looped LLMs" suggests pretrained models benefit from repeated computation for better performance. Baidu's ERNIE-4.5-VL-28B-A3B-Thinking offers lightweight multimodal reasoning with Apache 2.0 licensing, outperforming Gemini-2.5-Pro and GPT-5-High on document tasks. Databricks ai_parse_document preview delivers cost-efficient document intelligence outperforming GPT-5 and Claude. Pathwork AI uses LlamaCloud for underwriting automation. Gemini File Search API enables agentic retrieval augmented generation (RAG) with MCP server integration. Together AI and Collinear launch TraitMix for persona-driven agent simulations integrated with Together Evals. Reports highlight risks in long-running code agents like Claude Code reverting changes, emphasizing guardrails. Community consensus favors multiple code copilots including Claude Code, Codex, and others.
  • Nov 10
    not much happened today
    kimi-k2-thinking kimi-k3 gelato-30b-a3b omnilingual-wav2vec-2.0 moonshot-ai meta-ai-fair togethercompute qwen attention-mechanisms quantization fine-tuning model-optimization agentic-ai speech-recognition multilingual-models gui-manipulation image-editing dataset-release yuchenj_uw scaling01 code_star omarsar0 kimi_moonshot anas_awadalla akhaliq minchoi
    Moonshot AI's Kimi K2 Thinking AMA revealed a hybrid attention stack using KDA + NoPE MLA outperforming full MLA + RoPE, with the Muon optimizer scaling to ~1T parameters and native INT4 QAT for cost-efficient inference. K2 Thinking ranks highly on LisanBench and LM Arena Text leaderboards, offering low-cost INT4 serving and strong performance in Math, Coding, and Creative Writing. It supports heavy agentic tool use with up to 300 tool requests per run and recommends using the official API for reliable long-trace inference. Meta AI released the Omnilingual ASR suite covering 1600+ languages including 500 underserved, plus a 7B wav2vec 2.0 model and ASR corpus. Additionally, the Gelato-30B-A3B model for computer grounding in GUI manipulation agents outperforms larger VLMs, targeting immediate agent gains. Qwen's image-edit LoRAs and light-restoration app were also highlighted.
  • Nov 07
    Terminal-Bench 2.0 and Harbor
    kimi-k2-thinking moonshot-ai anthropic hugging-face ollama slime-framework benchmarking agentic-ai quantization model-optimization inference model-deployment moe context-windows cost-efficiency clementdelangue dbreunig awnihannun crystalsssup kimi_moonshot
    Terminal-Bench has fixed task issues and launched version 2.0 with cloud container support via the Harbor framework, gaining recognition from models like Claude 4.5 and Kimi K2 Thinking. Moonshot AI's Kimi K2 Thinking is a 1 trillion parameter MoE reasoning model with ~32B active parameters, running natively in INT4 quantization and featuring a 256K context window. It leads open-weights benchmarks with an Artificial Analysis Intelligence Index score of 67 and strong agentic performance, running efficiently on consumer Apple silicon and 2× M3 Ultra hardware. The model is broadly available on Hugging Face, Ollama Cloud, and integrated into frameworks like slime. Serving bottlenecks were traced to network bandwidth rather than GPU limits, highlighting infrastructure considerations for LLM deployment.
  • Nov 06
    Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch
    kimi-k2-thinking gemini moonshot-ai google apple vllm_project arena baseten yupp_ai mixture-of-experts quantization int4 context-window agentic-ai benchmarking model-deployment inference-acceleration api performance-optimization eliebakouch nrehiew_ andrew_n_carr ofirpress artificialanlys sundarpichai akhaliq
    Moonshot AI launched Kimi K2 Thinking, a 1 trillion parameter mixture-of-experts (MoE) model with 32 billion active experts, a 256K context window, and native INT4 quantization-aware training. It achieves state-of-the-art results on benchmarks like HLE (44.9%), BrowseComp (60.2%), and agentic tool use with 200-300 sequential tool calls. The model is deployed with vLLM support and OpenAI-compatible APIs, available on platforms like Arena, Baseten, and Yupp. Early user reports note some API instability under launch load. Meanwhile, Google announced the TPU v7 (Ironwood) with a 10× peak performance improvement over TPU v5p, aimed at training and agentic inference for models like Gemini. Apple added support for M5 Neural Accelerators in llama.cpp for inference acceleration.
  • Nov 05
    not much happened today
    kimi-k2 qwen3-next nemotron-nano-2 granite-4.0 gpt-4.5 copilot codex vllm perplexity-ai ibm anthropic graphiti claude cursor-ai microsoft mixture-of-experts model-integration cloud-computing hybrid-models benchmarking agent-systems memory-persistence semantic-search code-retrieval context-length-optimization tool-use evaluation-frameworks software-development scaling01 cedric_chee aravsrinivas omarsar0 _avichawla pierceboggan jo_parkhurst jyangballin ofirpress ml_angelopoulos
    Kimi-K2 Reasoner has been integrated into vLLM and will soon be supported by SGLang, featuring a massive 1.2 trillion parameter MoE configuration. Perplexity AI released research on cloud-portable trillion-parameter MoE kernels optimized for AWS EFA, with potential integration into vLLM. IBM's vLLM team formalized hybrid dense and sparse expert models, supporting models like Qwen3-Next, Nemotron Nano 2, and Granite 4.0. Kimi-K2 reportedly scores 77% on GPQA Diamond, outperforming GPT-4.5 at 71.4%, though this is unverified. Anthropic published a guide on efficient tool-heavy agent systems using MCP patterns, drastically reducing context tokens by ~98.7%. Graphiti MCP demonstrated shared memory across apps like Claude Desktop and Cursor for persistent agent memory. VS Code introduced an "Agent sessions" feature to unify agent management, including Copilot and Codex. Cursor AI improved coding accuracy via semantic search and code retrieval embeddings. New evaluation frameworks like CodeClash and LMArena assess agent and coding model performance in realistic multi-round tasks and occupation-tagged leaderboards.
  • Nov 04
    not much happened today
    trillium gemini-2.5-pro gemini-deepthink google huawei epoch-ai deutsche-telekom nvidia anthropic reka-ai weaviate deepmind energy-efficiency datacenters mcp context-engineering instruction-following embedding-models math-reasoning benchmarking code-execution sundarpichai yuchenj_uw teortaxestex epochairesearch scaling01 _avichawla rekaailabs anthropicai douwekiela omarsar0 nityeshaga goodside iscienceluvr lmthang
    Google's Project Suncatcher prototypes scalable ML compute systems in orbit using solar energy with Trillium-generation TPUs surviving radiation, aiming for prototype satellites by 2027. China's 50% electricity subsidies for datacenters may offset chip efficiency gaps, with Huawei planning gigawatt-scale SuperPoDs for DeepSeek by 2027. Epoch launched an open data center tracking hub, and Deutsche Telekom and NVIDIA announced a $1.1B Munich facility with 10k GPUs. In agent stacks, MCP (Model-Compute-Platform) tools gain traction with implementations like LitServe, Claude Desktop, and Reka's MCP server for VS Code. Anthropic emphasizes efficient code execution with MCP. Context engineering shifts focus from prompt writing to model input prioritization, with reports and tools from Weaviate, Anthropic, and practitioners highlighting instruction-following rerankers and embedding approaches. DeepMind's IMO-Bench math reasoning suite shows Gemini DeepThink achieving high scores, with a ProofAutoGrader correlating strongly with human grading. Benchmarks and governance updates include new tasks and eval sharing in lighteval.
  • Nov 03
    not much happened today
    qwen3-max-thinking minimax-m2 claude-3-sonnet llamaindex-light chronos-2 openai aws microsoft nvidia gpu_mode vllm alibaba arena llamaindex amazon anthropic gradio compute-deals gpu-optimization kernel-optimization local-serving reasoning long-context benchmarks long-term-memory time-series-forecasting agent-frameworks oauth-integration developer-tools sama gdb andrewcurran_ a1zhang m_sirovatka omarsar0 _philschmid
    OpenAI and AWS announced a strategic partnership involving a $38B compute deal to deploy hundreds of thousands of NVIDIA GB200 and GB300 chips, while Microsoft secured a license to ship NVIDIA GPUs to the UAE with a planned $7.9B datacenter investment. A 3-month NVFP4 kernel optimization competition on Blackwell B200s was launched by NVIDIA and GPU_MODE with prizes including DGX Spark and RTX 50XX GPUs. vLLM gains traction for local LLM serving, exemplified by PewDiePie's adoption. Alibaba previewed the Qwen3-Max-Thinking model hitting 100% on AIME 2025 and HMMT benchmarks, signaling advances in reasoning with tool use. The MIT-licensed MiniMax-M2 230B MoE model topped the Arena WebDev leaderboard, tying with Claude Sonnet 4.5 Thinking 32k. Critiques emerged on OSWorld benchmark stability and task validity. LlamaIndex's LIGHT framework demonstrated significant improvements in long-term memory tasks over raw context and RAG baselines, with gains up to +160.6% in summarization at 10M tokens. Amazon introduced Chronos-2, a time-series foundation model for zero-shot forecasting. The MCP ecosystem expanded with new tools like mcp2py OAuth integration and Gemini Docs MCP server, alongside a build sprint by Anthropic and Gradio offering substantial credits and prizes. "OSWorld doesn’t really exist—different prompt sets = incomparable scores" highlights benchmarking challenges.
  • Oct 31
    not much happened today
    Poolside raised $1B at a $12B valuation. Eric Zelikman raised $1B after leaving Xai. Weavy joined Figma. New research highlights FP16 precision reduces training-inference mismatch in reinforcement-learning fine-tuning compared to BF16. Kimi AI introduced a hybrid KDA (Kimi Delta Attention) architecture improving long-context throughput and RL stability, alongside a new Kimi CLI for coding with agent protocol support. OpenAI previewed Agent Mode in ChatGPT enabling autonomous research and planning during browsing.
  • Oct 30
    not much happened today
    kimi-linear kimi-delta-attention minimax-m2 looped-llms aardvark-gpt-5 moonshot-ai minimax bytedance princeton mila openai cursor cognition hkust long-context attention-mechanisms agentic-ai tool-use adaptive-compute coding-agents performance-optimization memory-optimization reinforcement-learning model-architecture kimi_moonshot scaling01 uniartisan omarsar0 aicodeking songlinyang4 iscienceluvr nrehiew_ gdb embeddedsec auchenberg simonw
    Moonshot AI released Kimi Linear (KDA) with day-0 infrastructure and strong long-context metrics, achieving up to 75% KV cache reduction and 6x decoding throughput. MiniMax M2 pivoted to full attention for multi-hop reasoning, maintaining strong agentic coding performance with 200k context and ~100 TPS. ByteDance, Princeton, and Mila introduced Looped LLMs showing efficiency gains comparable to larger transformers. OpenAI's Aardvark (GPT-5) entered private beta as an agentic security researcher for scalable vulnerability discovery. Cursor launched faster cloud coding agents, though transparency concerns arose regarding base-model provenance. Cognition released a public beta for a desktop/mobile tool-use agent named Devin. The community discussed advanced attention mechanisms and adaptive compute techniques.
  • Oct 29
    Cursor 2.0 & Composer-1: Fast Models and New Agents UI
    composer-1 gpt-oss-safeguard-20b gpt-oss-safeguard-120b gpt-oss gpt-5-mini cursor_ai openai huggingface ollama cerebras groq goodfireai rakuten agentic-coding reinforcement-learning mixture-of-experts fine-tuning policy-classification open-weight-models inference-stacks cost-efficiency multi-agent-systems ide voice-to-code code-review built-in-browser model-optimization sasha_rush dan_shipper samkottler ellev3n11 swyx
    Cursor 2.0 launched with Composer-1, an agentic coding model optimized for speed and precision, featuring multi-agent orchestration, built-in browser for testing, and voice-to-code capabilities. OpenAI released gpt-oss-safeguard models (20B, 120B) for policy-based safety classification, open-weight and fine-tuned from gpt-oss, available on Hugging Face and supported by inference stacks like Ollama and Cerebras. Goodfire and Rakuten demonstrated sparse autoencoders for PII detection matching gpt-5-mini accuracy at significantly lower cost. The Cursor 2.0 update also includes a redesigned interface for managing multiple AI coding agents, marking a major advancement in AI IDEs. "Fast-not-slowest" tradeoff emphasized by early users for Composer-1, enabling rapid iteration with human-in-the-loop.
  • Oct 28
    OpenAI completes Microsoft + For-profit restructuring + announces 2028 AI Researcher timeline + Platform / AI cloud product direction + next $1T of compute
    OpenAI has completed a major recapitalization and restructuring, forming a Public Benefit Corporation with a non-profit Foundation holding special voting rights and equity valued at $130B. Microsoft holds about 27% diluted ownership and committed to $250B in Azure spend, losing exclusivity on compute but retaining Azure API exclusivity until AGI is declared. The compute infrastructure deals for 2025 total 30GW worth $1.4T, with OpenAI aiming to build 1GW per week at $20B per GW, projecting $3-4 trillion infrastructure by 2033. The company is shifting focus from first-party apps to a platform approach, emphasizing ecosystem growth and third-party development. Sam Altman and Sama are key figures in this transition, with significant financial and strategic implications for AI industry partnerships, including openness to Anthropic and Google Gemini on Azure.
  • Oct 27
    MiniMax M2 230BA10B — 8% of Claude Sonnet's price, ~2x faster, new SOTA open model
    minimax-m2 hailuo-ai huggingface baseten vllm modelscope openrouter cline sparse-moe model-benchmarking model-architecture instruction-following tool-use api-pricing model-deployment performance-evaluation full-attention qk-norm gqa rope reach_vb artificialanlys akhaliq eliebakouch grad62304977 yifan_zhang_ zpysky1125
    MiniMax M2, an open-weight sparse MoE model by Hailuo AI, launches with ≈200–230B parameters and 10B active parameters, offering strong performance near frontier closed models and ranking #5 overall on the Artificial Analysis Intelligence Index v3.0. It supports coding and agent tasks, is licensed under MIT, and is available via API at competitive pricing. The architecture uses full attention, QK-Norm, GQA, partial RoPE, and sigmoid routing, with day-0 support in vLLM and deployment on platforms like Hugging Face and Baseten. Despite verbosity and no tech report, it marks a significant win for open models.
  • Oct 24
    not much happened today
    nemotron-nano-2 gpt-oss-120b qwen3 llama-3 minimax-m2 glm-4.6-air gemini-2.5-flash gpt-5.1-mini tahoe-x1 vllm_project nvidia mistral-ai baseten huggingface thinking-machines deeplearningai pytorch arena yupp-ai zhipu-ai scaling01 stanford transformer-architecture model-optimization inference distributed-training multi-gpu-support performance-optimization agents observability model-evaluation reinforcement-learning model-provenance statistical-testing foundation-models cancer-biology model-fine-tuning swyx dvilasuero _lewtun clementdelangue zephyr_z9 skylermiao7 teortaxestex nalidoust
    vLLM announced support for NVIDIA Nemotron Nano 2, featuring a hybrid Transformer–Mamba design and tunable "thinking budget" enabling up to 6× faster token generation. Mistral AI Studio launched a production platform for agents with deep observability. Baseten reported high throughput (650 TPS) for GPT-OSS 120B on NVIDIA hardware. Hugging Face InspectAI added inference provider integration for cross-provider evaluation. Thinking Machines Tinker abstracts distributed fine-tuning for open-weight LLMs like Qwen3 and Llama 3. In China, MiniMax M2 shows competitive performance with top models and is optimized for agents and coding, while Zhipu GLM-4.6-Air focuses on reliability and scaling for coding tasks. Rumors suggest Gemini 2.5 Flash may be a >500B parameter MoE model, and a possible GPT-5.1 mini reference appeared. Outside LLMs, Tahoe-x1 (3B) foundation model achieved SOTA in cancer cell biology benchmarks. Research from Stanford introduces a method to detect model provenance via training-order "palimpsest" with strong statistical guarantees.
  • Oct 23
    not much happened today
    gemini-1.5-pro claude-3 chatgpt langchain meta-ai-fair hugging-face openrouter google-ai microsoft openai anthropic agent-ops observability multi-turn-evaluation reinforcement-learning distributed-training api model-stability user-intent-clustering software-development project-management code-generation hwchase17 ankush_gola11 whinthorn koylanai _lewtun bhutanisanyam1 thom_wolf danielhanchen cline canvrno pashmerepat mustafasuleyman yusuf_i_mehdi jordirib1 fidjissimo bradlightcap mikeyk alexalbert__
    LangSmith launched the Insights Agent with multi-turn evaluation for agent ops and observability, improving failure detection and user intent clustering. Meta PyTorch and Hugging Face introduced OpenEnv, a Gymnasium-style API and hub for reproducible agentic environments supporting distributed training. Discussions highlighted the importance of provider fidelity in agent coding, with OpenRouter's exacto filter improving stability. Builder UX updates include Google AI Studio's Annotation mode for Gemini code changes, Microsoft's Copilot Mode enhancements in Edge, and OpenAI's Shared Projects and Company Knowledge features for ChatGPT Business. Claude added project-scoped Memory. In reinforcement learning, Meta's ScaleRL proposes a methodology to predict RL scaling outcomes for LLMs with improved efficiency and stability.
  • Oct 22
    not much happened today
    vllm chatgpt-atlas langchain meta microsoft openai pytorch ray claude agent-frameworks reinforcement-learning distributed-computing inference-correctness serving-infrastructure browser-agents security middleware runtime-systems documentation hwchase17 soumithchintala masondrxy robertnishihara cryps1s yuchenj_uw
    LangChain & LangGraph 1.0 released with major updates for reliable, controllable agents and unified docs, emphasizing "Agent Engineering." Meta introduced PyTorch Monarch and TorchForge for distributed programming and reinforcement learning, enabling large-scale agentic systems. Microsoft Learn MCP server now integrates with tools like Claude Code and VS Code for instant doc querying, accelerating grounded agent workflows. vLLM improved inference correctness with token ID returns and batch-invariant inference, collaborating with Ray for orchestration in PyTorch Foundation. OpenAI launched ChatGPT Atlas, a browser agent with contextual Q&A and advanced safety features, though early users note maturity challenges and caution around credential access.
  • Oct 21
    ChatGPT Atlas: OpenAI's AI Browser
    gemini atlas openai google langchain ivp capitalg sapphire sequoia benchmark agent-mode browser-memory chromium finetuning moe lora agent-runtime observability software-development funding kevinweil bengoodger fidjissimo omarsar0 yuchenj_uw nickaturley raizamrtn hwchase17 bromann casper_hansen_ corbtt
    OpenAI launched the Chromium fork AI browser Atlas for macOS, featuring integrated Agent mode and browser memory with local login capabilities, aiming to surpass Google's Gemini in Chrome. The launch received mixed reactions regarding reliability and privacy. LangChain raised a $125M Series B at a $1.25B valuation, releasing v1.0 agent engineering stack with significant adoption including 85M+ OSS downloads/month and usage by ~35% of the Fortune 500. The ecosystem also saw updates like vLLM's MoE LoRA expert finetuning support.
  • Oct 20
    DeepSeek-OCR finds vision models can decode 10x more efficiently with ~97% accuracy of text-only, 33/200k pages/day/A100
    deepseek-ocr deepseek3b-moe-a570m veo-3.1 deepseek-ai google-deepmind krea ocr vision multimodality model-compression long-context model-architecture video-generation autoregressive-models model-efficiency precision-editing karpathy teortaxestex reach_vb _akhaliq eliebakouch vikhyatk demishassabis
    As ICCV 2025 begins, DeepSeek releases a novel DeepSeek-OCR 3B MoE vision-language model that compresses long text as visual context with high accuracy and efficiency, challenging traditional tokenization approaches. The model achieves ~97% decoding precision at <10× compression and processes up to ~33M pages/day on 20 A100-40G nodes, outperforming benchmarks like GOT-OCR2.0. Discussions highlight the potential for unlimited context windows and tokenization-free inputs, with contributions from @karpathy, @teortaxesTex, and others. In video generation, google-deepmind's Veo 3.1 leads community benchmarks with advanced precision editing and scene blending, while Krea open-sources a 14B autoregressive video model enabling realtime long-form generation at ~11 FPS on a single B200 GPU.
  • Oct 17
    The Karpathy-Dwarkesh Interview delays AGI timelines
    claude-haiku-4.5 gpt-5 arch-router-1.5b anthropic openai huggingface langchain llamaindex google epoch-ai reasoning long-context sampling benchmarking data-quality agent-frameworks modular-workflows ide-extensions model-routing graph-first-agents real-world-grounding karpathy aakaran31 du_yilun giffmana omarsar0 jeremyphoward claude_code mikeyk alexalbert__ clementdelangue jerryjliu0
    The recent AI news highlights the Karpathy interview as a major event, alongside significant discussions on reasoning improvements without reinforcement learning, with test-time sampling achieving GRPO-level performance. Critiques on context window marketing reveal effective limits near 64K tokens, with Claude Haiku 4.5 showing competitive reasoning speed. GPT-5 struggles with advanced math benchmarks, and data quality issues termed "Brain Rot" affect model reasoning and safety. In agent frameworks, Anthropic Skills enable modular coding workflows, OpenAI Codex IDE extensions enhance developer productivity, and HuggingChat Omni introduces meta-routing across 100+ open models using Arch-Router-1.5B. LangChain and LlamaIndex advance graph-first agent infrastructure, while Google Gemini integrates with Google Maps for real-world grounding.
  • Oct 16
    Claude Agent Skills - glorified AGENTS.md? or MCP killer?
    claude-4.5-haiku claude chatgpt huggingchat-omni anthropic openai microsoft perplexity-ai huggingface groq cerebras togethercompute agent-skills document-processing long-context reasoning multi-model-routing memory-management voice vision simonwillison alexalbert__ mustafasuleyman yusuf_i_mehdi aravsrinivas
    Anthropic achieves a rare feat with back-to-back AI news headlines featuring Claude's new Skills—a novel way to build specialized agents using Markdown files, scripts, and metadata to handle tasks like creating and reading PDFs, Docs, and PPTs. Simon Willison calls this a "bigger deal than MCP," predicting a "Cambrian explosion in Skills." Meanwhile, Anthropic launches Claude 4.5 Haiku with strong reasoning and long-context capabilities, priced competitively. Other updates include OpenAI's ChatGPT memory management improvements, Windows 11 Copilot voice and vision features, and HuggingChat Omni routing across 115 open-source models from 15 providers. These developments highlight advances in agent skills, document processing, long-context reasoning, and multi-model routing.
  • Oct 15
    Claude Haiku 4.5
    claude-3.5-sonnet claude-3-haiku claude-3-haiku-4.5 gpt-5 gpt-4.1 gemma-2.5 gemma o3 anthropic google yale artificial-analysis shanghai-ai-lab model-performance fine-tuning reasoning agent-evaluation memory-optimization model-efficiency open-models cost-efficiency foundation-models agentic-workflows swyx sundarpichai osanseviero clementdelangue deredleritt3r azizishekoofeh vikhyatk mirrokni pdrmnvd akhaliq sayashk gne
    Anthropic released Claude Haiku 4.5, a model that is over 2x faster and 3x cheaper than Claude Sonnet 4.5, improving iteration speed and user experience significantly. Pricing comparisons highlight Haiku 4.5's competitive cost against models like GPT-5 and GLM-4.6. Google and Yale introduced the open-weight Cell2Sentence-Scale 27B (Gemma) model, which generated a novel, experimentally validated cancer hypothesis, with open-sourced weights for community use. Early evaluations show GPT-5 and o3 models outperform GPT-4.1 in agentic reasoning tasks, balancing cost and performance. Agent evaluation challenges and memory-based learning advances were also discussed, with contributions from Shanghai AI Lab and others. "Haiku 4.5 materially improves iteration speed and UX," and "Cell2Sentence-Scale yielded validated cancer hypothesis" were key highlights.
  • Oct 14
    not much happened today
    qwen3-vl-4b qwen3-vl-8b qwen2.5-vl-72b deepseek-v3.1 alibaba arena runway nvidia togethercompute ollama model-optimization fine-tuning inference-speed video-generation diffusion-models representation-learning local-ai speculative-decoding fp8-quantization context-windows karpathy
    Alibaba released compact dense Qwen3-VL models at 4B and 8B sizes with FP8 options, supporting up to 1M context and open vocabulary detection, rivaling larger models like Qwen2.5-VL-72B. Ecosystem support includes MLX-VLM, LM Studio, vLLM, Kaggle models, and Ollama Cloud. In video AI, Arena added Sora 2 models leading in video benchmarks, with Higgsfield Enhancer improving video quality. Runway launched domain-specific workflow apps for creative tasks. Research on Representation Autoencoders for DiTs (RAE-DiT) shows improved diffusion model performance. On local training, NVIDIA DGX Spark enables strong local fine-tuning, while Nanochat by Karpathy offers a minimal stack for training and inference. Together AI introduced ATLAS, a speculative decoding method achieving up to 4× faster inference on DeepSeek-V3.1. These developments highlight advances in efficient model deployment, video AI, local fine-tuning, and inference speed optimization.
  • Oct 13
    OpenAI Titan XPU: 10GW of self-designed chips with Broadcom
    llama-3-70b openai nvidia amd broadcom inferencemax asic inference compute-infrastructure chip-design fp8 reinforcement-learning ambient-agents custom-accelerators energy-consumption podcast gdb
    OpenAI is finalizing a custom ASIC chip design to deploy 10GW of inference compute, complementing existing deals with NVIDIA (10GW) and AMD (6GW). This marks a significant scale-up from OpenAI's current 2GW compute, aiming for a roadmap of 250GW total, which is half the energy consumption of the US. Greg from OpenAI highlights the shift of ChatGPT from interactive use to always-on ambient agents requiring massive compute, emphasizing the challenge of building chips for billions of users. The in-house ASIC effort was driven by the need for tailored designs after limited success influencing external chip startups. Broadcom's stock surged 10% on the news. Additionally, InferenceMAX reports improved ROCm stability and nuanced performance comparisons between AMD MI300X and NVIDIA H100/H200 on llama-3-70b FP8 workloads, with RL training infrastructure updates noted.
  • Oct 10
    not much happened today
    gpt-5-pro gemini-2.5 vllm deepseek-v3.1 openai google-deepmind microsoft epoch-ai-research togethercompute nvidia mila reasoning reinforcement-learning inference speculative-decoding sparse-attention kv-cache-management throughput-optimization compute-efficiency tokenization epochairesearch yitayml _philschmid jiqizhixin cvenhoff00 neelnanda5 lateinteraction mgoin_ blackhc teortaxestex
    FrontierMath Tier 4 results show GPT-5 Pro narrowly outperforming Gemini 2.5 Deep Think in reasoning accuracy, with concerns about problem leakage clarified by Epoch AI Research. Mila and Microsoft propose Markovian Thinking to improve reasoning efficiency, enabling models to reason over 24K tokens with less compute. New research suggests base models inherently contain reasoning mechanisms, with "thinking models" learning to invoke them effectively. In systems, NVIDIA Blackwell combined with vLLM wins InferenceMAX with significant throughput gains, while Together AI's ATLAS adaptive speculative decoding achieves 4× speed improvements and reduces RL training time by over 60%. SparseServe introduces dynamic sparse attention with KV tiering, drastically improving throughput and latency in GPU memory management.
  • Oct 09
    Air Street's State of AI 2025 Report
    glm-4.6 jamba-1.5 rnd1 claude-code reflection mastra datacurve spellbook kernel figure softbank abb radicalnumerics zhipu-ai ai21-labs anthropic humanoid-robots mixture-of-experts diffusion-models open-weight-models reinforcement-learning benchmarking small-language-models plugin-systems developer-tools agent-stacks adcock_brett achowdhery clementdelangue
    Reflection raised $2B to build frontier open-weight models with a focus on safety and evaluation, led by a team with backgrounds from AlphaGo, PaLM, and Gemini. Figure launched its next-gen humanoid robot, Figure 03, emphasizing non-teleoperated capabilities for home and large-scale use. Radical Numerics released RND1, a 30B-parameter sparse MoE diffusion language model with open weights and code to advance diffusion LM research. Zhipu posted strong results with GLM-4.6 on the Design Arena benchmark, while AI21 Labs' Jamba Reasoning 3B leads tiny reasoning models. Anthropic introduced a plugin system for Claude Code to enhance developer tools and agent stacks. The report also highlights SoftBank's acquisition of ABB's robotics unit for $5.4B and the growing ecosystem around open frontier modeling and small-model reasoning.
  • Oct 08
    not much happened today
    7m-tiny-recursive-model jamba-reasoning-3b qwen3-omni qwen-image-edit-2509 colbert-nano agentflow samsung lecuun ai21-labs alibaba coreweave weights-biases openpipe stanford recursive-reasoning density-estimation multimodality long-context retrieval serverless-reinforcement-learning agentic-systems model-efficiency reinforcement-learning transformers rasbt jm_alexia jiqizhixin randall_balestr corbtt shawnup _akhaliq
    Samsung's 7M Tiny Recursive Model (TRM) achieves superior reasoning on ARC-AGI and Sudoku with fewer layers and MLP replacing self-attention. LeCun's team introduces JEPA-SCORE, enabling density estimation from encoders without retraining. AI21 Labs releases Jamba Reasoning 3B, a fast hybrid SSM-Transformer model supporting up to 64K context tokens. Alibaba's Qwen3 Omni/Omni Realtime offers a unified audio-video-text model with extensive language and speech support, outperforming Gemini 2.0 Flash on BigBench Audio. Alibaba also debuts Qwen Image Edit 2509, a top open-weight multi-image editing model. ColBERT Nano models demonstrate effective retrieval at micro-scale parameter sizes. In reinforcement learning, CoreWeave, Weights & Biases, and OpenPipe launch serverless RL infrastructure reducing costs and speeding training. Stanford's AgentFlow presents an in-the-flow RL system with a 7B backbone outperforming larger models on agentic tasks. This update highlights advances in recursive reasoning, density estimation, multimodal architectures, long-context modeling, retrieval, and serverless reinforcement learning.
  • Oct 07
    Gemini 2.5 Computer Use preview beats Sonnet 4.5 and OAI CUA
    gemini-2.5 gpt-5-pro glm-4.6 codex google-deepmind openai microsoft anthropic zhipu-ai llamaindex mongodb agent-frameworks program-synthesis security multi-agent-systems computer-use-models open-source moe developer-tools workflow-automation api vision reasoning swyx demishassabis philschmid assaf_elovic hwchase17 jerryjliu0 skirano fabianstelzer blackhc andrewyng
    Google DeepMind released a new Gemini 2.5 Computer Use model for browser and Android UI control, evaluated by Browserbase. OpenAI showcased GPT-5 Pro, new developer tools including Codex with Slack integration, and agent-building SDKs at Dev Day. Google DeepMind's CodeMender automates security patching for large codebases. Microsoft introduced an open-source Agent Framework for multi-agent enterprise systems. AI community discussions highlight agent orchestration, program synthesis, and UI control advancements. GLM-4.6 update from Zhipu features a large Mixture-of-Experts model with 355B parameters.
  • Oct 06
    OpenAI Dev Day: Apps SDK, AgentKit, Codex GA, GPT‑5 Pro and Sora 2 APIs
    gpt-5-pro gpt-realtime-mini-2025-10-06 gpt-audio-mini-2025-10-06 gpt-image-1-mini sora-2 sora-2-pro openai canva figma zillow coursera api model-release fine-tuning agentic-ai code-generation model-deployment pricing prompt-optimization software-development multimodality sama edwinarbus gdb dbreunig stevenheidel
    OpenAI showcased major product launches at their DevDay including the Apps SDK, AgentKit, and Codex now generally available with SDK and enterprise features. They introduced new models such as gpt-5-pro, gpt-realtime-mini-2025-10-06, gpt-audio-mini-2025-10-06, gpt-image-1-mini, and sora-2 with a pro variant. The Apps SDK enables embedding interactive apps inside ChatGPT with partners like Canva, Figma, Zillow, and Coursera. AgentKit offers a full stack for building and deploying production agents with tools like ChatKit and Guardrails. Codex supports speech and controller-driven coding, credited with high internal shipping velocity. Pricing for GPT-5 Pro was revealed at $15 input and $120 output per million tokens. "OpenAI turned ChatGPT into an application platform" and "AgentKit built a working agent in under 8 minutes" were highlights.
See all issues

Let's Connect

If you want to get in touch with me about something or just to say hi, reach out on social media or send me an email.

  • GitHub /
  • X (@smol_ai) /
  • swyx at smol dot ai
© 2025 • AINews
You can also subscribe by rss .
Press Esc or click anywhere to close