AINews
subscribe / issues / tags /

AINews

by smol.ai

How over 80k top AI Engineers keep up, every weekday.

Thanks for subscribing! Please check your email to confirm your subscription.

We respect your privacy. Full signup link here.

We summarize top AI discords + AI reddits + AI X/Twitters, and send you a roundup each day!

"Highest-leverage 45 mins I spend everyday" - Soumith

" best AI newsletter atm " and " I'm not sure that enough people subscribe " - Andrej

"genuinely incredible" - Chris

"surprisingly decent" - Hamel

You can pay for a customizable version here . Thanks to Pieter Levels for the Lex Fridman feature!

Last 30 days in AI

Invalid regex
See all issues
  • Dec 18
    Claude Skills grows: Open Standard, Directory, Org Admin
    claude-skills gpt-5.2-codex gemini-3-flash functiongemma t5gemma-2 anthropic openai google-deepmind hugging-face agentic-ai fine-tuning long-context tool-calling on-device-ai multimodality security workflow-optimization sama gregbrockman philschmid
    Claude Skills are gaining significant traction since their launch in October, with a milestone of 100k views in one day for the Claude Skills talk, signaling growing adoption and importance. Announcements include org admin support, a new Skills Directory, and the move to an open standard named Agent Skills. In frontier model launches, OpenAI released GPT-5.2-Codex, touted as the best agentic coding model with improvements in native compaction, long-context reliability, and tool-calling, emphasizing real-world security impacts. Google DeepMind introduced Gemini 3 Flash, focusing on speed as a product feature impacting workflows and user engagement, alongside FunctionGemma and T5Gemma 2, emphasizing on-device deployment, fine-tuning, and multimodality.
  • Dec 17
    Gemini 3.0 Flash Preview: 1/4 cost of Pro, but ~as smart, retakes Pareto Frontier
    gemini-3-flash gemini-3 gpt-5.2 gemini-3-pro google google-deepmind tool-calling multimodality benchmarking reasoning cost-efficiency model-performance context-window agentic-ai model-deployment sundar_pichai jeffdean demishassabis
    Google launched Gemini 3 Flash, a pro-grade reasoning model with flash latency, supporting tool calling and multimodal IO, available via multiple platforms including Google AI Studio and Vertex AI. It offers competitive pricing at $0.50 per 1M input tokens and $3.00 per 1M output tokens, with context windows up to 1M tokens. Benchmarks show Gemini 3 Flash rivals or outperforms larger models like GPT-5.2 and Gemini 3 Pro in agentic, coding, and reasoning tasks, validated by ARC-AGI-2, SWE-bench, LMArena, and Arena benchmarks. Despite some tradeoffs like high token use and hallucination rates, it is cost-effective overall. Key figures include Sundar Pichai, Jeff Dean, and Demis Hassabis who publicly celebrated this achievement. The model's tool calling capabilities were demonstrated with 100 tools in a live demo.
  • Dec 16
    OpenAI GPT Image-1.5 claims to beat Nano Banana Pro, #1 across all Arenas, but completely fails Vibe Checks
    gpt-image-1.5 nano-banana-pro mimo-v2-flash deepseek-v3.2 openai gemini xiaomi lmsys deepseek openrouter image-generation instruction-following benchmarking model-efficiency long-context multi-token-prediction hybrid-attention model-optimization inference-speed agentic-workflows model-architecture model-quantization fuli_luo eliebakouch
    OpenAI released its new image model GPT Image 1.5, featuring precise image editing, better instruction following, improved text and markdown rendering, and faster generation up to 4×. Despite topping multiple leaderboards like LMArena (1277), Design Arena (1344), and AA Arena (1272), user feedback from Twitter, Reddit, and Discord communities is largely negative compared to Nano Banana Pro by Gemini. Xiaomi introduced the MiMo-V2-Flash, a 309B MoE model optimized for inference efficiency with 256K context window, achieving state-of-the-art scores on SWE-Bench. The model uses Hybrid Sliding Window Attention and multi-token prediction, offering significant speedups and efficiency improvements. The timing of OpenAI's launch amid competition from Gemini and Nano Banana Pro affects user sentiment, highlighting challenges in benchmarking relevance.
  • Dec 15
    NVIDIA Nemotron 3: hybrid Mamba-Transformer completely open source models from 30B to 500B
    nemotron-3-nano qwen3-30b-a3b-base nvidia huggingface togethercompute baseten vllm llamaindex hybrid-architecture mixture-of-experts reinforcement-learning long-context model-release open-source-models model-training model-optimization benchmarking agent-training ctnzr andrew_n_carr awnihannun
    NVIDIA has released Nemotron 3 Nano, a fully open-source hybrid Mamba-Transformer Mixture-of-Experts (MoE) model with a 30B parameter size and a 1 million token context window. It includes open weights, training recipes, datasets, and an RL environment suite called NeMo Gym, supporting commercial use under the NVIDIA Open Model License. The model achieves state-of-the-art results on benchmarks like SWE-Bench and Artificial Analysis Intelligence Index, outperforming Qwen3-30B A3B. Ecosystem support is immediate with integrations into inference stacks like vLLM, llama.cpp, and Baseten. Upcoming larger models, Nemotron Super and Ultra, will feature NVFP4 pretraining and LatentMoE routing to optimize compute. This release marks a significant milestone for open-source American AI with comprehensive open assets and advanced hybrid architecture.
  • Dec 12
    not much happened today
    gpt-5.2 opus-4.5 gemini-3-pro gpt-5.1 olmo-3.1-32b qwen3-vl-235b openai allen_ai mistral-ai ollama lmstudio thinkymachines reinforcement-learning model-benchmarking long-context model-quantization model-optimization inference-speed sparsity fine-tuning vision sama scaling01 akhaliq artificialanlys lechmazur acerfur epochairesearch
    GPT-5.2 shows mixed performance in public evaluations, excelling in agentic tasks but at a significantly higher cost (~$620/run) compared to Opus 4.5 and GPT-5.1. It performs variably on reasoning and coding benchmarks, with some improvements on long-context tasks. Extended "reasoning effort" settings notably impact results. Aggregators rank Gemini 3 Pro above GPT-5.2 in task persistence. OpenAI released sparse activation models sparking debate on sparsity vs MoE architectures. Allen AI's Olmo 3.1 (32B) advances open reinforcement learning scale with substantial compute investment (~125k H100 hours). Mistral's Devstral-2 and llama.cpp improve local inference infrastructure with new features like GGUF support and distributed speedups. Tinker platform goes GA with vision input and finetuning support for Qwen3-VL-235B.
  • Dec 11
    GPT-5.2 (Instant/Thinking/Pro): 74% on GDPVal, 1.4x cost of GPT 5.1, on 10 Year OpenAI Anniversary
    gpt-5.2 openai scientific-reasoning knowledge-work long-context benchmarking performance-optimization pricing software-engineering vision sama yanndubs polynoamial scaling01
    OpenAI celebrates its 10 year anniversary with the launch of GPT-5.2, featuring significant across-the-board improvements including a rare 40% price increase. GPT-5.2 shows strong performance gains in scientific reasoning, knowledge work, and economic value tasks, achieving over 70.9% human expert parity on GDPval tasks and reaching 90.5% on ARC-AGI-1 with a large efficiency gain. Despite some mixed results in coding benchmarks and vision capabilities, GPT-5.2 is well received as a major update with extended context and tiered reasoning controls. Pricing is set at $1.75/M input and $14/M output tokens with a 90% cache discount. The update is live in ChatGPT and API, marking a significant milestone for OpenAI's LLM development.
  • Dec 10
    not much happened today
    nomos-1 axiomprover devstral-2-small deepseek-v3.2 claude-code cursor-2.2 claude-opus-4.5 gpt-5 claude-sonnet-4.5 gemini-3-pro llama qwen mistral gemma nousresearch thinkymachines mistral-ai deepseek anthropic cursor microsoft langchain-ai openai gemini intel vllm_project danielhanchen math formal-reasoning agentic-systems asynchronous-execution multi-agent-systems observability benchmarking quantization post-training-quantization training-speedup kernel-optimization inference-efficiency
    NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack. Mistral's Devstral 2 Small outperforms DeepSeek v3.2 in 71% of preferences with better speed and cost. Anthropic's Claude Code introduces asynchronous agent execution. Cursor 2.2 adds deep agent primitives like Debug and Plan Modes. VS Code launches unified agent chat sessions improving multi-agent workflows. LangChain releases "Polly" for agent observability. The Stirrup harness leads OpenAI GDPval benchmarks with Claude Opus 4.5, GPT-5, and Gemini 3 Pro following. Advances in quantization include vLLM integrating Intel's AutoRound PTQ for efficient serving. Unsloth achieves up to 3× training speedups with new kernels across Llama, Qwen, Mistral, and Gemma models. "Compositional reasoning + specialized post-training under constrained active params can rival frontier closed models on formal math."
  • Dec 09
    MCP -> Agentic AI Foundation, Mistral Devstral 2
    devstral-2 devstral-small-2 sonnet-4.3 deepseek-v3.2 qwen3-vl openai anthropic block mistral-ai alibaba linux-foundation deepseek agentic-ai coding-models reinforcement-learning model-performance model-optimization open-weights cli-tools multi-file-code-automation data-decontamination moe reward-models rl-stability guillaumelample b_roziere qtnx_ charliermarsh omarsar0 eliebakouch justinwaugh cwolferesearch pan
    OpenAI Engineering sees a significant collaborative milestone with the launch of the Agentic AI Foundation under the Linux Foundation, uniting projects from Anthropic, OpenAI, and Block. Mistral released Devstral 2, a coding model with 123B parameters and open weights, offering a cost-effective alternative to Sonnet 4.3 and competitive performance against DeepSeek v3.2. The new Mistral Vibe CLI supports agentic coding workflows with rapid ecosystem integration. Alibaba introduced Soft Adaptive Policy Optimization (SAPO) for reinforcement learning tuning, improving stability and performance in Qwen3-VL across multiple tasks. Research highlights include the importance of data decontamination in RL and ongoing discussions on MoE RL stability and reward hacking mitigation.
  • Dec 08
    not much happened today
    glm-4.6v glm-4.6v-flash jina-vlm-2b hugging-face zhipu-ai jina-ai google-deepmind axiomprover fine-tuning multimodality model-optimization long-context mechanistic-interpretability formal-methods sequence-architectures reinforcement-learning lioronai akshay_pachaar _akhaliq ben_burtenshaw vllm_project prince_canuma zenmuxai eliebakouch theturingpost axiommathai neelnanda5 sarahookr
    Claude Code Skills gains attention with a published talk and Hugging Face's new "skill" enabling one-line fine-tuning pipelines for models from ~0.5B to 70B parameters, supporting SFT, DPO, and GRPO, costing as low as ~$0.30 for small runs. Zhipu AI launches multimodal models GLM-4.6V (106B params MoE) and GLM-4.6V-Flash (9B dense), featuring 128k context and native multimodal function calling, with free Flash variant and API pricing detailed. Jina AI releases Jina-VLM (2B), a compact multilingual VLM excelling in diagrams and documents with top benchmark scores. At NeurIPS 2025, research highlights include Google's post-Transformer sequence architectures (Moneta, Yaad, Memora) showing up to 20% gains in long-context retrieval, AxiomProver's autonomous Lean system solving 9/12 Putnam 2025 problems rapidly, and mechanistic interpretability advances discussed by Chris Olah emphasizing scalable tooling.
  • Dec 05
    not much happened today
    vllm-0.12.0 gemma3n qwen3-omni qwen3-vl gpt-5.1-codex-max gemini-3-pro runway-gen-4.5 kling-video-2.6 vllm nvidia huggingface langchain-ai together-ai meta-ai-fair sonarsource openrouter runway gemini arena gpu-programming quantization multimodality agent-platforms reinforcement-learning static-analysis reasoning inference-infrastructure model-optimization economics audio video-generation jeremyphoward mervenoyann sydneyrunkle swyx maximelabonne
    vLLM 0.12.0 introduces DeepSeek support, GPU Model Runner V2, and quantization improvements with PyTorch 2.9.0 and CUDA 12.9. NVIDIA launches CUDA Tile IR and cuTile Python for advanced GPU tensor operations targeting Blackwell GPUs. Hugging Face releases Transformers v5 RC with an any-to-any multimodal pipeline supporting models like Gemma3n and Qwen3-Omni. Agent platforms see updates from LangChain with content moderation and cost tracking, Together AI and Meta AI collaborate on RL for long-horizon workflows, and SonarSource integrates static analysis into AI codegen. Economic insights from OpenRouter highlight coding as a key AI application, with reasoning models surpassing 50% usage and market bifurcation between premium and open models. Additionally, Kling Video 2.6 debuts native audio capabilities, and Runway Gen-4.5, Qwen3-TTS, and Gemini 3 Pro advance multimodality.
  • Dec 04
    OpenRouter's State of AI - An Empirical 100 Trillion Token Study
    grok-code-fast gemini-3 gemini-3-deep-think gpt-5.1-codex-max openrouter deepseek anthropic google google-deepmind reasoning coding tokenization long-context model-architecture benchmarking agentic-ai prompt-engineering quocleix noamshazeer mirrokni
    OpenRouter released its first survey showing usage trends with 7 trillion tokens proxied weekly, highlighting a 52% roleplay bias. Deepseek's open model market share has sharply declined due to rising coding model usage. Reasoning model token usage surged from 0% to over 50%. Grok Code Fast shows high usage, while Anthropic leads in tool calling and coding requests with around 60% share. Input tokens quadrupled and output tokens tripled this year, driven mainly by programming use cases, which dominate spending and volume. Google launched Gemini 3 Deep Think, featuring parallel thinking and achieving 45.1% on ARC-AGI-2 benchmarks, and previewed Titans, a long-context neural memory architecture scaling beyond 2 million tokens. These advances were shared by Google DeepMind and Google AI on Twitter.
  • Dec 03
    not much happened today
    kling-2.6 kling-o1 runway-gen-4.5 gemini-3 deepseek-v3.2 ministral-3 evoqwen2.5-vl hermes-4.3 intellect-3 openai anthropic google runway elevenlabs freepik openart deepseek mistral-ai alibaba nous-research video-generation audio-processing multimodality image-generation reasoning model-quantization sparse-attention model-pricing multimodal-models retrieval-augmentation model-training model-release
    OpenAI's Code Red response and Anthropic's IPO are major highlights. In AI video and imaging, Kling 2.6 introduces native audio co-generation with coherent lip-sync, partnered with platforms like ElevenLabs and OpenArt. Runway Gen-4.5 enhances lighting fidelity, while Google's Gemini 3 Nano Banana Pro supports advanced image compositing. Open model releases include DeepSeek V3.2 with sparse attention and cost-effective pricing, and Mistral's Ministral 3 multimodal family with strong 14B variants. Retrieval and code models from Alibaba's EvoQwen2.5-VL and Nous Research's Hermes 4.3 show competitive performance with permissive licensing and HF availability. The community arena sees additions like INTELLECT-3 (106B MoE). "coherent looking & sounding output" and "auto-lighting to match scene mood" are noted advancements.
  • Dec 02
    DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling
    deepseek-v3.2 deepseek-v3.2-speciale gpt-5-high sonnet-4.5 gemini-3-pro deepseek_ai lm-arena agentic-ai reinforcement-learning large-context-windows model-benchmarking model-performance multi-agent-systems model-training model-deployment suchenzang teortaxestex
    DeepSeek launched the DeepSeek V3.2 family including Standard, Thinking, and Speciale variants with up to 131K context window and competitive benchmarks against GPT-5-High, Sonnet 4.5, and Gemini 3 Pro. The release features a novel Large Scale Agentic Task Synthesis Pipeline focusing on agentic behaviors and improvements in reinforcement learning post-training algorithms. The models are available on platforms like LM Arena with pricing around $0.28/$0.42 per million tokens. Community feedback is mixed, praising the frontier reasoning capabilities but critiquing the chat UI experience. Key figures include Susan Zhang and Teortaxes who provided commentary on the release.
  • Dec 02
    Mistral 3: Mistral Large 3 + Ministral 3B/8B/14B open weights models
    mistral-large-3 ministral-3 clara-7b-instruct gen-4.5 claude-code mistral-ai anthropic apple runway moondream sparse-moe multimodality benchmarking open-source model-licensing model-performance long-context inference-optimization instruction-following local-inference code-generation model-integration anjney_midha _akhaliq alexalbert__ _catwu mikeyk
    Mistral has launched the Mistral 3 family including Ministral 3 models (3B/8B/14B) and Mistral Large 3, a sparse MoE model with 675B total parameters and 256k context window, all under an Apache 2.0 open license. Early benchmarks rank Mistral Large 3 at #6 among open models with strong coding performance. The launch includes broad ecosystem support such as vLLM, llama.cpp, Ollama, and LM Studio integrations. Meanwhile, Anthropic acquired the open-source Bun runtime to accelerate Claude Code, which reportedly reached a $1B run-rate in ~6 months. Anthropic also announced discounted Claude plans for nonprofits and shared insights on AI's impact on work internally.
  • Nov 26
    not much happened today
    claude-opus-4.5 qwen-3-4b qwen-3-8b qwen-3-14b deepseek-r1 anthropic booking.com perplexity-ai langchain claude scaling01 deepseek qwen prefect agent-systems multi-agent-systems reasoning benchmarking cost-efficiency model-optimization long-context memory-management reinforcement-learning model-performance multi-agent-communication latent-representation inference-cost software-integration jeremyphoward alexalbert__ omarsar0 lingyang_pu dair_ai
    Anthropic introduces durable agents and MCP tasks for long-running workflows, with practical engineering patterns and integrations like Prefect. Booking.com deploys a large-scale agent system improving customer satisfaction using LangGraph, Kubernetes, GPT-4 Mini, and Weaviate. Perplexity rolls out user-level memory and virtual try-on features. Claude Opus 4.5 leads on LisanBench and Code Arena WebDev benchmarks with mixed community feedback on its "thinking" and "non-thinking" modes, while improving cost-efficiency and UX with batch APIs and context compaction. Research on multi-agent systems shows LatentMAS reduces communication tokens by 70-84% and improves accuracy using Qwen3 models, and reasoning trace distillation achieves significant token reduction with maintained accuracy, highlighting the importance of reasoning trace style.
  • Nov 25
    Black Forest Labs FLUX.2 [pro|flex|dev|klein]: near-Nano Banana quality but Open Weights
    flux-2 flux-2-dev claude-opus-4.5 gpt-5.1 gemini-3-pro black-forest-labs anthropic huggingface multi-reference-support variational-autoencoder image-generation open-weights agentic-coding token-efficiency benchmarking prompting model-performance
    Black Forest Labs' FLUX.2 release features Multi-Reference Support for up to 4 Megapixel output and up to 10 images with consistency, including four form factors: Pro, Flex, Dev (32B Open Weight model), and Klein (TBA Open Weights). The new FLUX.2 - VAE introduces a variational autoencoder optimizing learnability, quality, and compression. Meanwhile, Anthropic's Claude Opus 4.5 demonstrates strong performance and efficiency, scoring 70 on Artificial Analysis, tying with GPT-5.1 high and trailing Gemini 3 Pro (73). Opus 4.5 excels in agentic coding benchmarks and research evaluations, with notable token efficiency and reduced running costs. "Opus 4.5 leads Gemini 3 Pro on SWE-Bench Verified and tops the AICodeKing leaderboard," and it shows strong QA and systematic review capabilities. Anthropic also released a dense prompting guide for Opus 4.5.
  • Nov 24
    Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus
    claude-opus-4.5 gemini-3-pro gpt-5.1-codex-max opus-4.1 sonnet-4.5 anthropic amazon google anthropic coding agents tool-use token-efficiency benchmarking api model-pricing model-performance effort-control context-compaction programmatic-tool-calling alexalbert__ btibor91 scaling01 klieret
    Anthropic launched Claude Opus 4.5, a new flagship model excelling in coding, agents, and tooling with a significant 3x price cut compared to Opus 4.1 and improved token efficiency using 76% fewer output tokens. Opus 4.5 achieved a new SOTA on SWE-bench Verified with 80.9% accuracy, surpassing previous models like Gemini 3 Pro and GPT-5.1-Codex-Max. The update includes advanced API features such as effort control, context compaction, and programmatic tool calling, improving tool accuracy and reducing token usage. Claude Code is now bundled with Claude Desktop, and new integrations like Claude for Chrome and Excel are rolling out. Benchmarks show Opus 4.5 breaking the 80% barrier on SWE-bench Verified and strong performance on ARC-AGI-2 and BrowseComp-Plus.
  • Nov 21
    AI Engineer Code Summit
    gemini-3-pro-image gemini-3 gpt-5 claude-3.7-sonnet google-deepmind togethercompute image-generation fine-tuning benchmarking agentic-ai physics model-performance instruction-following model-comparison time-horizon user-preference demishassabis omarsar0 lintool hrishioa teknium artificialanlys minyangtian1 ofirpress metr_evals scaling01
    The recent AIE Code Summit showcased key developments including Google DeepMind's Gemini 3 Pro Image model, Nano Banana Pro, which features enhanced text rendering, 4K visuals, and fine-grained editing capabilities. Community feedback highlights its strong performance in design and visualization tasks, with high user preference scores. Benchmarking updates reveal the new CritPt physics frontier benchmark where Gemini 3 Pro outperforms GPT-5, though AI still lags on complex unseen research problems. Agentic task evaluations show varied time horizons and performance gaps between open-weight and closed frontier models, emphasizing ongoing challenges in AI research and deployment. "Instruction following remains jagged for some users," and model fit varies by use case, with Gemini 3 excelling in UI and code tasks but showing regressions in transcription and writing fidelity.
  • Nov 20
    Nano Banana Pro (Gemini Image Pro) solves text-in-images, infographic generation, 2-4k resolution, and Google Search grounding
    gemini-3-pro gpt-5 google openai hugging-face togethercompute lmsys image-generation text-rendering model-provenance scientific-research proof-assistance multimodal-integration api-access fine-tuning jeffdean kevinweil demishassabis
    Google launched Gemini 3 Pro Image (Nano Banana Pro), a next-generation AI image generation and editing model with integrated Google Search grounding, multi-image composition, and fine-grained visual controls, offering pricing at $0.134 per 2K image and $0.24 per 4K image. It features improved text rendering with error rates dropping from 56% to 8% compared to its predecessor, and includes SynthID watermark checks for provenance. The model is available via Gemini App, API, LM Arena, Hugging Face Spaces, Together AI, and Flow. Meanwhile, OpenAI shared early experiments with GPT-5 accelerating scientific research, including proofs of previously unsolved problems in math, physics, biology, and materials science. "GPT-5 accelerated research tasks in math/physics/biology/materials; in 4, it helped find proofs of previously unsolved problems."
  • Nov 19
    OpenAI fires back: GPT-5.1-Codex-Max (API) and GPT 5.1 Pro (ChatGPT)
    gpt-5.1-codex-max gpt-5.1-codex gemini-3-pro claude-3.5-sonnet openai google anthropic langchain-ai coding autonomous-systems benchmarking model-scaling multi-agent-systems model-performance reasoning model-architecture sama
    OpenAI released GPT-5.1-Codex-Max, featuring compaction-native training, an "Extra High" reasoning mode, and claims of over 24-hour autonomous operation, showing significant performance gains on benchmarks like METR, CTF, and PaperBench. Google's Gemini 3 Pro demonstrates strong coding and reasoning capabilities, achieving new state-of-the-art results on SWE-bench Verified and WeirdML, with estimated model size between 5-10 trillion parameters. The AI coding agent ecosystem is rapidly evolving with integrations and tooling improvements from multiple companies. Sam Altman highlighted the significant improvements in GPT-5.1-Codex-Max. The news also covers educational offerings like ChatGPT for Teachers and multi-agent workflows involving Gemini 3, GPT-5.1-Codex-Max, and Claude Sonnet 4.5.
  • Nov 18
    Gemini 3 Pro — new GDM frontier model 6, Gemini 3 Deep Think, and Antigravity IDE
    gemini-3-pro gemini-2.5 grok-4.1 sonnet-4.5 gpt-5.1 google google-deepmind multimodality agentic-ai benchmarking context-window model-performance instruction-following model-pricing api model-release reasoning model-evaluation sundarpichai _philschmid oriol_vinyals
    Google launched Gemini 3 Pro, a state-of-the-art model with a 1M-token context window, multimodal reasoning, and strong agentic capabilities, priced significantly higher than Gemini 2.5. It leads major benchmarks, surpassing Grok 4.1 and competing closely with Sonnet 4.5 and GPT-5.1, though GPT-5.1 excels in ultralong summarization. Independent evaluations from Artificial Analysis, Vending Bench, ARC-AGI 2, Box, and PelicanBench validate Gemini 3 as a frontier LLM. Google also introduced Antigravity, an agentic IDE powered by Gemini 3 Pro and other models, featuring task orchestration and human-in-the-loop validation. The launch marks Google's strong return to AI with more models expected soon. "Google is very, very back in the business."
  • Nov 17
    xAI Grok 4.1: #1 in Text Arena, #1 in EQ-bench, and better Creative Writing
    grok-4.1 gpt-5.1 claude-4.1-opus grok-4 gpt-5 grok-4.1-thinking gpt-5-pro claude-4.5-haiku xai openai google-deepmind sakana-ai anthropic microsoft mufg khosla nea lux-capital iqt model-performance creative-writing hallucination evaluation-datasets ensemble-models weather-forecasting funding efficiency anti-hallucination arc-agi model-scaling yanndubs gregkamradt philschmid willccbb
    xAI launched Grok 4.1, achieving a #1 rank on the LM Arena Text Leaderboard with an Elo score of 1483, showing improvements in creative writing and anti-hallucination. OpenAI's GPT-5.1 "Thinking" demonstrates efficiency gains with ~60% less "thinking" on easy queries and strong ARC-AGI performance. Google DeepMind released WeatherNext 2, an ensemble generative model that is 8× faster and more accurate for global weather forecasts, integrated into multiple Google products. Sakana AI raised ¥20B ($135M) in Series B funding at a $2.63B valuation to focus on efficient AI for resource-constrained enterprise applications in Japan. New evaluations highlight tradeoffs between hallucination and knowledge accuracy across models including Claude 4.1 Opus and Anthropic models.
  • Nov 14
    not much happened today
    gpt-5.1 sonnet-4.5 opus-4.1 gemini-3 openai anthropic langchain-ai google-deepmind adaptive-reasoning developer-tools prompt-optimization json-schema agent-workflows context-engineering structured-outputs model-release benchmarking swyx allisontam_ gdb sama alexalbert__ simonw omarsar0 abacaj scaling01 amandaaskell
    OpenAI launched GPT-5.1 featuring "adaptive reasoning" and developer-focused API improvements, including prompt caching and a reasoning_effort toggle for latency/cost tradeoffs. Independent analysis shows a minor intelligence bump with significant gains in agentic coding benchmarks. Anthropic's Claude models introduced structured outputs with JSON schema compliance in public beta for Sonnet 4.5 and Opus 4.1, enhancing tooling and code execution workflows. Rumors of an Opus 4.5 release were debunked. LangChain released a "Deep Agents" package and context-engineering playbook to optimize agent workflows. The community is eagerly anticipating Google DeepMind's Gemini 3 model, hinted at in social media and upcoming AIE CODE events. "Tickets are sold out, but side events and volunteering opportunities are available."
  • Nov 13
    minor updates to GPT 5.1 and SIMA 2
    gpt-5.1 gpt-5.1-codex gpt-5.1-codex-mini sima-2 gemini openai google-deepmind github microsoft cursor_ai perplexity-ai weaviate llamaindex adaptive-reasoning agentic-coding tool-use context-engineering memory-architecture self-improvement retrieval-augmentation database-query-planning chart-parsing robotics sama allisontam_ cline cognition demishassabis omarsar0 helloiamleonie
    OpenAI released GPT-5.1 family models including 5.1-Codex and 5.1-Codex-Mini with improved steerability, faster responses, and new tools like apply_patch and shell command execution. Pricing remains unchanged from 5.0. Immediate integrations include GitHub Copilot, VS Code, Cursor, and Perplexity adopting GPT-5.1 models. Google DeepMind announced SIMA 2, a Gemini-powered agent capable of language instruction following, planning, and self-improvement without human feedback, targeting robotics applications. New research on context engineering and agentic tool use patterns was published, with contributions from Weaviate and LlamaIndex on database query planning and chart parsing respectively. "Adaptive reasoning" and agentic coding improvements are highlighted in GPT-5.1- Instant.
  • Nov 12
    GPT 5.1 in ChatGPT: No evals, but adaptive thinking and instruction following
    gpt-5.1 gpt-5.0 claude isaac-0.1 qwen3vl-235b glm-4.6 gemini openai anthropic waymo perceptron langchain llamaindex nousresearch adaptive-reasoning instruction-following personalization autonomous-driving robotics multimodality agent-evaluation agent-governance middleware structured-extraction benchmarking dmitri_dolgov jeffdean fidji_simo akshats07
    OpenAI launched GPT-5.1 with improvements in conversational tone, instruction following, and adaptive reasoning. GPT-5.0 is being sunset in 3 months. ChatGPT introduces new tone toggles for personalization, serving over 800 million users. Waymo rolls out freeway driving for public riders in major California cities, showcasing advances in autonomous driving. Anthropic's Project Fetch explores LLMs as robotics copilots using Claude. Perceptron releases a new API and Python SDK for multimodal perception-action apps supporting Isaac-0.1 and Qwen3VL-235B. Code Arena offers live coding evaluations supporting Claude, GPT-5, GLM-4.6, and Gemini. LangChain introduces middleware for agent governance with human-in-the-loop controls. LlamaIndex releases a structured extraction template for SEC filings using LlamaAgents. NousResearch promotes ARC Prize benchmarks for generalized intelligence evaluation.
  • Nov 11
    not much happened today
    gpt-5 qwen2.5-7b ernie-4.5-vl-28b-a3b-thinking gemini-2.5-pro llamacloud claude-code openai baidu databricks llamaindex togethercompute sakanaailabs reasoning-benchmarks reinforcement-learning fine-tuning multimodality document-intelligence retrieval-augmented-generation agentic-systems persona-simulation code-agents guardrails sakanaailabs micahgoldblum francoisfleuret matei_zaharia jerryjliu0 omarsar0 togethercompute imjaredz theo
    GPT-5 leads Sudoku-Bench solving 33% of puzzles but 67% remain unsolved, highlighting challenges in meta-reasoning and spatial logic. New training methods like GRPO fine-tuning and "Thought Cloning" show limited success. Research on "looped LLMs" suggests pretrained models benefit from repeated computation for better performance. Baidu's ERNIE-4.5-VL-28B-A3B-Thinking offers lightweight multimodal reasoning with Apache 2.0 licensing, outperforming Gemini-2.5-Pro and GPT-5-High on document tasks. Databricks ai_parse_document preview delivers cost-efficient document intelligence outperforming GPT-5 and Claude. Pathwork AI uses LlamaCloud for underwriting automation. Gemini File Search API enables agentic retrieval augmented generation (RAG) with MCP server integration. Together AI and Collinear launch TraitMix for persona-driven agent simulations integrated with Together Evals. Reports highlight risks in long-running code agents like Claude Code reverting changes, emphasizing guardrails. Community consensus favors multiple code copilots including Claude Code, Codex, and others.
  • Nov 10
    not much happened today
    kimi-k2-thinking kimi-k3 gelato-30b-a3b omnilingual-wav2vec-2.0 moonshot-ai meta-ai-fair togethercompute qwen attention-mechanisms quantization fine-tuning model-optimization agentic-ai speech-recognition multilingual-models gui-manipulation image-editing dataset-release yuchenj_uw scaling01 code_star omarsar0 kimi_moonshot anas_awadalla akhaliq minchoi
    Moonshot AI's Kimi K2 Thinking AMA revealed a hybrid attention stack using KDA + NoPE MLA outperforming full MLA + RoPE, with the Muon optimizer scaling to ~1T parameters and native INT4 QAT for cost-efficient inference. K2 Thinking ranks highly on LisanBench and LM Arena Text leaderboards, offering low-cost INT4 serving and strong performance in Math, Coding, and Creative Writing. It supports heavy agentic tool use with up to 300 tool requests per run and recommends using the official API for reliable long-trace inference. Meta AI released the Omnilingual ASR suite covering 1600+ languages including 500 underserved, plus a 7B wav2vec 2.0 model and ASR corpus. Additionally, the Gelato-30B-A3B model for computer grounding in GUI manipulation agents outperforms larger VLMs, targeting immediate agent gains. Qwen's image-edit LoRAs and light-restoration app were also highlighted.
  • Nov 07
    Terminal-Bench 2.0 and Harbor
    kimi-k2-thinking moonshot-ai anthropic hugging-face ollama slime-framework benchmarking agentic-ai quantization model-optimization inference model-deployment moe context-windows cost-efficiency clementdelangue dbreunig awnihannun crystalsssup kimi_moonshot
    Terminal-Bench has fixed task issues and launched version 2.0 with cloud container support via the Harbor framework, gaining recognition from models like Claude 4.5 and Kimi K2 Thinking. Moonshot AI's Kimi K2 Thinking is a 1 trillion parameter MoE reasoning model with ~32B active parameters, running natively in INT4 quantization and featuring a 256K context window. It leads open-weights benchmarks with an Artificial Analysis Intelligence Index score of 67 and strong agentic performance, running efficiently on consumer Apple silicon and 2× M3 Ultra hardware. The model is broadly available on Hugging Face, Ollama Cloud, and integrated into frameworks like slime. Serving bottlenecks were traced to network bandwidth rather than GPU limits, highlighting infrastructure considerations for LLM deployment.
  • Nov 06
    Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch
    kimi-k2-thinking gemini moonshot-ai google apple vllm_project arena baseten yupp_ai mixture-of-experts quantization int4 context-window agentic-ai benchmarking model-deployment inference-acceleration api performance-optimization eliebakouch nrehiew_ andrew_n_carr ofirpress artificialanlys sundarpichai akhaliq
    Moonshot AI launched Kimi K2 Thinking, a 1 trillion parameter mixture-of-experts (MoE) model with 32 billion active experts, a 256K context window, and native INT4 quantization-aware training. It achieves state-of-the-art results on benchmarks like HLE (44.9%), BrowseComp (60.2%), and agentic tool use with 200-300 sequential tool calls. The model is deployed with vLLM support and OpenAI-compatible APIs, available on platforms like Arena, Baseten, and Yupp. Early user reports note some API instability under launch load. Meanwhile, Google announced the TPU v7 (Ironwood) with a 10× peak performance improvement over TPU v5p, aimed at training and agentic inference for models like Gemini. Apple added support for M5 Neural Accelerators in llama.cpp for inference acceleration.
  • Nov 05
    not much happened today
    kimi-k2 qwen3-next nemotron-nano-2 granite-4.0 gpt-4.5 copilot codex vllm perplexity-ai ibm anthropic graphiti claude cursor-ai microsoft mixture-of-experts model-integration cloud-computing hybrid-models benchmarking agent-systems memory-persistence semantic-search code-retrieval context-length-optimization tool-use evaluation-frameworks software-development scaling01 cedric_chee aravsrinivas omarsar0 _avichawla pierceboggan jo_parkhurst jyangballin ofirpress ml_angelopoulos
    Kimi-K2 Reasoner has been integrated into vLLM and will soon be supported by SGLang, featuring a massive 1.2 trillion parameter MoE configuration. Perplexity AI released research on cloud-portable trillion-parameter MoE kernels optimized for AWS EFA, with potential integration into vLLM. IBM's vLLM team formalized hybrid dense and sparse expert models, supporting models like Qwen3-Next, Nemotron Nano 2, and Granite 4.0. Kimi-K2 reportedly scores 77% on GPQA Diamond, outperforming GPT-4.5 at 71.4%, though this is unverified. Anthropic published a guide on efficient tool-heavy agent systems using MCP patterns, drastically reducing context tokens by ~98.7%. Graphiti MCP demonstrated shared memory across apps like Claude Desktop and Cursor for persistent agent memory. VS Code introduced an "Agent sessions" feature to unify agent management, including Copilot and Codex. Cursor AI improved coding accuracy via semantic search and code retrieval embeddings. New evaluation frameworks like CodeClash and LMArena assess agent and coding model performance in realistic multi-round tasks and occupation-tagged leaderboards.
See all issues

Let's Connect

If you want to get in touch with me about something or just to say hi, reach out on social media or send me an email.

  • GitHub /
  • X (@smol_ai) /
  • swyx at smol dot ai
© 2025 • AINews
You can also subscribe by rss .
Press Esc or click anywhere to close