AINews

by smol.ai

How over 80k top AI Engineers keep up, every weekday.

We respect your privacy. Full signup link here.

We summarize top AI discords + AI reddits + AI X/Twitters, and send you a roundup each day!

"Highest-leverage 45 mins I spend everyday" - Soumith

" best AI newsletter atm " and " I'm not sure that enough people subscribe " - Andrej

"genuinely incredible" - Chris

"surprisingly decent" - Hamel

Thanks to Pieter Levels for the Lex Fridman feature!

Last 30 days in AI

Filter titles:

See all issues

Jan 08

not much happened today

claude-3-7-sonnet gpt-4-1 gemini-3 qwen3-vl-embedding qwen3-vl-reranker glm-4-7 falcon-h1r-7b jamba2 stanford google google-deepmind alibaba z-ai tii ai21-labs huggingface copyright-extraction multimodality multilinguality retrieval-augmented-generation model-architecture mixture-of-experts model-quantization reasoning inference kernel-engineering memory-optimization enterprise-ai sundarpichai justinlin610

Stanford paper reveals Claude 3.7 Sonnet memorized 95.8% of Harry Potter 1, highlighting copyright extraction risks compared to GPT-4.1. Google AI Studio sponsors TailwindCSS amid OSS funding debates. Google and Sundar Pichai launch Gmail Gemini 3 features including AI Overviews and natural-language search with user controls. Alibaba Qwen releases Qwen3-VL-Embedding and Qwen3-VL-Reranker, a multimodal, multilingual retrieval stack supporting text, images, and video with quantization and instruction customization, achieving strong benchmark results. Z.ai goes public on HKEX with GLM-4.7 leading the Artificial Analysis Intelligence Index v4.0, showing gains in reasoning, coding, and agentic use, with large-scale MoE architecture and MIT license. Falcon-H1R-7B from TII targets efficient reasoning in smaller models, scoring 16 on the Intelligence Index. AI21 Labs introduces Jamba2, a memory-efficient enterprise model with hybrid SSM-Transformer architecture and Apache 2.0 license, available via SaaS and Hugging Face. vLLM shows throughput improvements in inference and kernel engineering. "Embeddings should be multimodal by default," notes Justin Lin.
Jan 07

not much happened today

nouscoder-14b deepseek-r1 langchain cursor huggingface openai weights-biases agent-frameworks context-management reinforcement-learning operational-safety model-transparency trajectory-exploration token-optimization coding-agents integration-platforms karpathy _philschmid omarsar0

AI News for 1/6/2026-1/7/2026 highlights a quiet day with key updates on LangChain DeepAgents introducing Ralph Mode for persistent agent loops, Cursor improving context management by reducing token usage by 46.9%, and operational safety measures for coding agents with allow/deny lists. MCP integration is expanding across assistants and robotics, with Hugging Face embedding assistants via HuggingChat + HF MCP server. The DeepSeek-R1 paper has been expanded to 86 pages, emphasizing trajectory exploration and RL shaping behavior. NousCoder-14B shows a +7% improvement on LiveCodeBench after 4 days of RL training, demonstrating advances in RL for coding with small open models. Top tweets also mention a viral "96GB RAM laptop", ChatGPT Health launch by OpenAI, and Karpathy's nanochat scaling-law miniseries.
Jan 06

xAI raises $20B Series E at ~$230B valuation

grok-5 claude-code xai nvidia cisco fidelity valor-equity-partners qatar-investment-authority mgx stepstone-group baron-capital-group hugging-face amd ai-infrastructure supercomputing robotics ai-hardware agentic-ai context-management token-optimization local-ai-assistants aakash_gupta fei-fei_li lisa_su clementdelangue thom_wolf saradu omarsar0 yuchenj_uw _catwu cursor_ai

xAI, Elon Musk's AI company, completed a massive $20 billion Series E funding round, valuing it at about $230 billion with investors like Nvidia, Cisco Investments, and others. The funds will support AI infrastructure expansion including Colossus I and II supercomputers and training Grok 5, leveraging data from X's 600 million monthly active users. At CES 2026, the focus was on "AI everywhere" with a strong emphasis on AI-first hardware and integration between NVIDIA and Hugging Face's LeRobot for robotics development. The Reachy Mini robot is gaining traction as a consumer robotics platform. In software, Claude Code is emerging as a popular local/private coding assistant, with new UI features in Claude Desktop and innovations like Cursor's dynamic context reducing token usage by nearly 47% in multi-MCP setups. "The 600 million MAU figure in xAI’s announcement combines X platform users with Grok users. That’s a clever framing choice."
Jan 05

not much happened today

claude-mem bitnet-cpp gemini microsoft google-deepmind boston-dynamics agentic-coding agent-harnesses persistent-memory software-engineering inference-efficiency model-pruning context-durability specification-problem workflow-management cpu-inference _philschmid demishassabis

AI News from early January 2026 highlights a viral economic prediction about Vietnam surpassing Thailand, Microsoft's reported open-sourcing of bitnet.cpp for 1-bit CPU inference promising speed and energy gains, and a new research partnership between Google DeepMind and Boston Dynamics focusing on Gemini Robotics and Atlas hardware. The concept of agentic coding is gaining traction, emphasizing human oversight and infrastructure layers called Agent Harnesses to manage long-running AI tasks, with advocates like Philipp Schmid promoting this shift. Innovations in persistent memory for coding agents, such as Claude-Mem, aim to improve context durability. There is also critical discussion on the specification problem in agent workflows, advocating for better abstractions beyond conversational intent. Practical challenges include managing parallel agents and permission risks. Additionally, open tooling advances include a JAX-based LLM-Pruning Collection for efficient model pruning methods.
Jan 02

not much happened today

DeepSeek released a new paper on mHC: Manifold-Constrained Hyper-Connections, advancing residual-path design as a key scaling lever in neural networks. Their approach constrains residual mixing matrices to the Birkhoff polytope to improve stability and performance, with only about 6.7% training overhead. The innovation includes systems-level optimizations like fused kernels and activation recomputation, highlighting a frontier-lab integration of math and kernel engineering. Additionally, discussions around long-horizon agents emphasize context management bottlenecks, introducing Recursive Language Models (RLMs) that manage context dynamically rather than relying on larger context windows. This work signals a shift in architectural design and efficiency for base model training and agent development.
Dec 31, 2025

not much happened today

qwen-image-2512 ax-k1 k-exaone sk-telecom lg upstage naver alibaba unsloth replicate mixture-of-experts model-release quantization open-source-models image-generation model-integration model-benchmarking compute-costs dataset-curation eliebakouch clementdelangue dorialexander rising_sayak _akhaliq ostrisai ivanfioravanti yupp_ai

South Korea's Ministry of Science launched a coordinated program with 5 companies to develop sovereign foundation models from scratch, featuring large-scale MoE architectures like SK Telecom A.X-K1 (519B total / 33B active) and LG K-EXAONE (236B MoE / 23B active), with a total first-round budget of ~$140M. This initiative contrasts with EU approaches by focusing funding on fewer stakeholders and explicitly budgeting for data. Meanwhile, Alibaba's Qwen-Image-2512 emerges as a leading open-source image generation model, rapidly integrated into various toolchains including AI-Toolkit and local inference paths with quantization support, and hosted on platforms like Replicate. The model has undergone extensive blind testing with over 10,000 rounds on AI Arena, highlighting its ecosystem adoption.
Dec 30, 2025

not much happened today

glm-4.7 claude-code z.ai meta-ai-fair manus replit agentic-architecture context-engineering application-layer code-generation agent-habitats ai-native-llm ipo inference-infrastructure programming-paradigms zixuanli_ jietang yuchenj_uw sainingxie amasad hidecloud imjaredz random_walker

Z.ai (GLM family) IPO in Hong Kong on Jan 8, 2026, aiming to raise $560M at HK$4.35B, marking it as the "first AI-native LLM company" public listing. The IPO highlights GLM-4.7 as a starting point. Meta AI acquired Manus for approximately $4–5B, with Manus achieving $100M ARR in 8–9 months, illustrating the value of application-layer differentiation over proprietary models. Manus focuses on agentic architecture, context engineering, and general primitives like code execution and browser control, emphasizing "agent habitats" as a competitive moat. Discussions around Claude Code highlight skepticism about "vibe coding," advocating for disciplined, framework-like AI-assisted programming practices.
Dec 29, 2025

Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch

glm-4.7 minimax-m2.1 vllm manus benchmark meta-ai-fair vllm amd sglang weaviate teknim baseten alphaxiv minimax performance-optimization inference-frameworks model-benchmarking model-deployment open-source-models multimodality api code-generation community-building alex_wang nat_friedman

Manus achieved a rapid growth trajectory in 2025, raising $500M from Benchmark and reaching $100M ARR before being acquired by Meta for an estimated $4B. The vLLM team launched a dedicated community site with new resources, while performance issues with AMD MI300X FP8 were noted in vLLM and sglang benchmarks. Weaviate released operational features including Object TTL, Java v6 client GA, and multimodal document embeddings. API fragmentation concerns were raised by Teknium advocating for unified SDK wrappers. In open-weight models, GLM-4.7 gained recognition as a reliable coding model with faster throughput on Baseten, and MiniMax-M2.1 rose as a leading open agentic coder model, topping WebDev leaderboards.
Dec 26, 2025

not much happened today

minimax-m2.1 glm-4.7 gemini-3-pro claude-3-sonnet vl-jepa minimax-ai vllm-project exolabs mlx apple openai open-source mixture-of-experts local-inference quantization inference-quality multimodality non-autoregressive-models video-processing reinforcement-learning self-play agentic-rl parallel-computing model-deployment ylecun awnihannun alexocheema edwardsun0909 johannes_hage

MiniMax M2.1 launches as an open-source agent and coding Mixture-of-Experts (MoE) model with ~10B active / ~230B total parameters, claiming to outperform Gemini 3 Pro and Claude Sonnet 4.5, and supports local inference including on Apple Silicon M3 Ultra with quantization. GLM 4.7 demonstrates local scaling on Mac Studios with 2× 512GB M3 Ultra hardware, highlighting system-level challenges like bandwidth and parallelism. The concept of inference quality is emphasized as a key factor affecting output variance across deployments. Yann LeCun's VL-JEPA proposes a non-generative, non-autoregressive multimodal model operating in latent space for efficient real-time video processing with fewer parameters and decoding operations. Advances in agentic reinforcement learning for coding include self-play methods where agents inject and fix bugs autonomously, enabling self-improvement without human labeling, and large-scale RL infrastructure involving massive parallel code generation and execution sandboxes.
Dec 24, 2025

Nvidia buys (most of) Groq for $20B cash; largest execuhire ever

gemini fsd-v14 nvidia groq openai tesla epoch-ai gemini benchmarking inference model-evaluation ai-integration agent-patterns real-time-processing low-latency developer-experience healthcare business-workflows consumer-ai jensen_huang xeophon js_denain jim_fan

Groq leadership team is joining Nvidia under a "non-exclusive licensing agreement" in a deal valued at $20 billion cash, marking a major acquisition in AI chip space though Nvidia states it is not acquiring Groq as a company. Jensen Huang plans to integrate Groq's low-latency processors into the NVIDIA AI factory architecture to enhance AI inference and real-time workloads. Twitter highlights include Gemini used as a consumer utility for calorie tracking, OpenAI discussing the "deployment gap" focusing on model usage in healthcare and business, and Tesla's FSD v14 described as a "Physical Turing Test" for consumer AI. Benchmarking challenges are noted by Epoch AI emphasizing provider variance and integration issues affecting model quality measurement. Discussions on coding agents and developer experience convergence continue in the AI community.
Dec 23, 2025

not much happened today

glm-4.7 glm-4.6 minimax-m2.1 gemma-3 gemma-scope-2 google-deepmind valsai minimax-ai ollama trae alibaba sophont prime-intellect interpretability sparse-autoencoders agent-workflows model-benchmarking medical-evaluation multi-agent-systems model-performance model-optimization reinforcement-learning tool-use function-calling context-windows ivanfioravanti awnihannun deedydas cline omarsar0 adonis_singh eliebakouch teortaxestex ibragim_bad callum_mcdougall neelnanda5

GLM-4.7 and MiniMax M2.1 open-weight model releases highlight day-0 ecosystem support, coding throughput, and agent workflows, with GLM-4.7 achieving a +9.5% improvement over GLM-4.6 and MiniMax M2.1 positioned as an OSS Claude-like MoE model with 230B total parameters and 200K context. Gemma Scope 2 from google-deepmind introduces sparse autoencoders and transcoders for interpretability across Gemma 3 models, aiming to provide shared infrastructure for safety and debugging. The Medmarks v0.1 open medical evaluation suite and leaderboard launch addresses the need for open medical benchmarking across 15+ environments, engaging clinicians and researchers.
Dec 22, 2025

not much happened today

glm-4.7 mimo-v2-flash z-image-turbo kling-2.6-motion-control zhipu-ai xiaomi google langchain huggingface openrouter artificial-analysis vllm-project coding complex-reasoning tool-use mixture-of-experts cost-efficiency open-weight-models text-to-image video-models memory-persistence agent-frameworks interactive-user-interfaces model-deployment mervenoyann eliebakouch omarsar0 osanseviero dair_ai

Zhipu AI's GLM-4.7 release marks a significant improvement in coding, complex reasoning, and tool use, quickly gaining ecosystem adoption via Hugging Face and OpenRouter. Xiaomi's MiMo-V2-Flash is highlighted as a practical, cost-efficient mixture-of-experts model optimized for deployment. The open-weight text-to-image competition sees Z-Image Turbo leading with 6B parameters under Apache-2.0 license. Video model advances focus on control and long-form consistency, exemplified by Kling 2.6 Motion Control and research like MemFlow's adaptive memory retrieval. In agent frameworks, Google's A2UI protocol introduces agent-driven UI generation, while studies reveal that mixing multiple agent frameworks is common, with challenges in logic, termination, and tool interaction. LangChain emphasizes persistent memory patterns for production agents.
Dec 19, 2025

not much happened today

qwen-image-layered kling-2.6 gwm-1 gen-4.5 gemini-3-flash gpt-5.2 codex-cli opus-4.5 alibaba kling-ai runway google anthropic openai image-decomposition motion-control video-generation agentic-reinforcement-learning long-context model-degradation benchmarking tool-use prompt-engineering ankesh_anand

Alibaba released Qwen-Image-Layered, an open-source model enabling Photoshop-grade layered image decomposition with recursive infinite layers and prompt-controlled structure. Kling 2.6 introduced advanced motion control for image-to-video workflows, supported by a creator contest and prompt recipes. Runway unveiled the GWM-1 family with frame-by-frame video generation and Gen-4.5 updates adding audio and multi-shot editing. In LLM platforms, Gemini 3 Flash leads benchmarks over GPT-5.2, attributed to agentic reinforcement learning improvements post-distillation. Users note GPT-5.2 excels at long-context tasks (~256k tokens) but face UX limitations pushing some to use Codex CLI. Discussions around Anthropic Opus 4.5 suggest perceived model degradation linked to user expectations.
Dec 18, 2025

Claude Skills grows: Open Standard, Directory, Org Admin

claude-skills gpt-5.2-codex gemini-3-flash functiongemma t5gemma-2 anthropic openai google-deepmind hugging-face agentic-ai fine-tuning long-context tool-calling on-device-ai multimodality security workflow-optimization sama gregbrockman philschmid

Claude Skills are gaining significant traction since their launch in October, with a milestone of 100k views in one day for the Claude Skills talk, signaling growing adoption and importance. Announcements include org admin support, a new Skills Directory, and the move to an open standard named Agent Skills. In frontier model launches, OpenAI released GPT-5.2-Codex, touted as the best agentic coding model with improvements in native compaction, long-context reliability, and tool-calling, emphasizing real-world security impacts. Google DeepMind introduced Gemini 3 Flash, focusing on speed as a product feature impacting workflows and user engagement, alongside FunctionGemma and T5Gemma 2, emphasizing on-device deployment, fine-tuning, and multimodality.
Dec 17, 2025

Gemini 3.0 Flash Preview: 1/4 cost of Pro, but ~as smart, retakes Pareto Frontier

gemini-3-flash gemini-3 gpt-5.2 gemini-3-pro google google-deepmind tool-calling multimodality benchmarking reasoning cost-efficiency model-performance context-window agentic-ai model-deployment sundar_pichai jeffdean demishassabis

Google launched Gemini 3 Flash, a pro-grade reasoning model with flash latency, supporting tool calling and multimodal IO, available via multiple platforms including Google AI Studio and Vertex AI. It offers competitive pricing at $0.50 per 1M input tokens and $3.00 per 1M output tokens, with context windows up to 1M tokens. Benchmarks show Gemini 3 Flash rivals or outperforms larger models like GPT-5.2 and Gemini 3 Pro in agentic, coding, and reasoning tasks, validated by ARC-AGI-2, SWE-bench, LMArena, and Arena benchmarks. Despite some tradeoffs like high token use and hallucination rates, it is cost-effective overall. Key figures include Sundar Pichai, Jeff Dean, and Demis Hassabis who publicly celebrated this achievement. The model's tool calling capabilities were demonstrated with 100 tools in a live demo.
Dec 16, 2025

OpenAI GPT Image-1.5 claims to beat Nano Banana Pro, #1 across all Arenas, but completely fails Vibe Checks

gpt-image-1.5 nano-banana-pro mimo-v2-flash deepseek-v3.2 openai gemini xiaomi lmsys deepseek openrouter image-generation instruction-following benchmarking model-efficiency long-context multi-token-prediction hybrid-attention model-optimization inference-speed agentic-workflows model-architecture model-quantization fuli_luo eliebakouch

OpenAI released its new image model GPT Image 1.5, featuring precise image editing, better instruction following, improved text and markdown rendering, and faster generation up to 4×. Despite topping multiple leaderboards like LMArena (1277), Design Arena (1344), and AA Arena (1272), user feedback from Twitter, Reddit, and Discord communities is largely negative compared to Nano Banana Pro by Gemini. Xiaomi introduced the MiMo-V2-Flash, a 309B MoE model optimized for inference efficiency with 256K context window, achieving state-of-the-art scores on SWE-Bench. The model uses Hybrid Sliding Window Attention and multi-token prediction, offering significant speedups and efficiency improvements. The timing of OpenAI's launch amid competition from Gemini and Nano Banana Pro affects user sentiment, highlighting challenges in benchmarking relevance.
Dec 15, 2025

NVIDIA Nemotron 3: hybrid Mamba-Transformer completely open source models from 30B to 500B

nemotron-3-nano qwen3-30b-a3b-base nvidia huggingface togethercompute baseten vllm llamaindex hybrid-architecture mixture-of-experts reinforcement-learning long-context model-release open-source-models model-training model-optimization benchmarking agent-training ctnzr andrew_n_carr awnihannun

NVIDIA has released Nemotron 3 Nano, a fully open-source hybrid Mamba-Transformer Mixture-of-Experts (MoE) model with a 30B parameter size and a 1 million token context window. It includes open weights, training recipes, datasets, and an RL environment suite called NeMo Gym, supporting commercial use under the NVIDIA Open Model License. The model achieves state-of-the-art results on benchmarks like SWE-Bench and Artificial Analysis Intelligence Index, outperforming Qwen3-30B A3B. Ecosystem support is immediate with integrations into inference stacks like vLLM, llama.cpp, and Baseten. Upcoming larger models, Nemotron Super and Ultra, will feature NVFP4 pretraining and LatentMoE routing to optimize compute. This release marks a significant milestone for open-source American AI with comprehensive open assets and advanced hybrid architecture.
Dec 12, 2025

not much happened today

gpt-5.2 opus-4.5 gemini-3-pro gpt-5.1 olmo-3.1-32b qwen3-vl-235b openai allen_ai mistral-ai ollama lmstudio thinkymachines reinforcement-learning model-benchmarking long-context model-quantization model-optimization inference-speed sparsity fine-tuning vision sama scaling01 akhaliq artificialanlys lechmazur acerfur epochairesearch

GPT-5.2 shows mixed performance in public evaluations, excelling in agentic tasks but at a significantly higher cost (~$620/run) compared to Opus 4.5 and GPT-5.1. It performs variably on reasoning and coding benchmarks, with some improvements on long-context tasks. Extended "reasoning effort" settings notably impact results. Aggregators rank Gemini 3 Pro above GPT-5.2 in task persistence. OpenAI released sparse activation models sparking debate on sparsity vs MoE architectures. Allen AI's Olmo 3.1 (32B) advances open reinforcement learning scale with substantial compute investment (~125k H100 hours). Mistral's Devstral-2 and llama.cpp improve local inference infrastructure with new features like GGUF support and distributed speedups. Tinker platform goes GA with vision input and finetuning support for Qwen3-VL-235B.
Dec 11, 2025

GPT-5.2 (Instant/Thinking/Pro): 74% on GDPVal, 1.4x cost of GPT 5.1, on 10 Year OpenAI Anniversary

gpt-5.2 openai scientific-reasoning knowledge-work long-context benchmarking performance-optimization pricing software-engineering vision sama yanndubs polynoamial scaling01

OpenAI celebrates its 10 year anniversary with the launch of GPT-5.2, featuring significant across-the-board improvements including a rare 40% price increase. GPT-5.2 shows strong performance gains in scientific reasoning, knowledge work, and economic value tasks, achieving over 70.9% human expert parity on GDPval tasks and reaching 90.5% on ARC-AGI-1 with a large efficiency gain. Despite some mixed results in coding benchmarks and vision capabilities, GPT-5.2 is well received as a major update with extended context and tiered reasoning controls. Pricing is set at $1.75/M input and $14/M output tokens with a 90% cache discount. The update is live in ChatGPT and API, marking a significant milestone for OpenAI's LLM development.
Dec 10, 2025

not much happened today

nomos-1 axiomprover devstral-2-small deepseek-v3.2 claude-code cursor-2.2 claude-opus-4.5 gpt-5 claude-sonnet-4.5 gemini-3-pro llama qwen mistral gemma nousresearch thinkymachines mistral-ai deepseek anthropic cursor microsoft langchain-ai openai gemini intel vllm_project danielhanchen math formal-reasoning agentic-systems asynchronous-execution multi-agent-systems observability benchmarking quantization post-training-quantization training-speedup kernel-optimization inference-efficiency

NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack. Mistral's Devstral 2 Small outperforms DeepSeek v3.2 in 71% of preferences with better speed and cost. Anthropic's Claude Code introduces asynchronous agent execution. Cursor 2.2 adds deep agent primitives like Debug and Plan Modes. VS Code launches unified agent chat sessions improving multi-agent workflows. LangChain releases "Polly" for agent observability. The Stirrup harness leads OpenAI GDPval benchmarks with Claude Opus 4.5, GPT-5, and Gemini 3 Pro following. Advances in quantization include vLLM integrating Intel's AutoRound PTQ for efficient serving. Unsloth achieves up to 3× training speedups with new kernels across Llama, Qwen, Mistral, and Gemma models. "Compositional reasoning + specialized post-training under constrained active params can rival frontier closed models on formal math."
Dec 09, 2025

MCP -> Agentic AI Foundation, Mistral Devstral 2

devstral-2 devstral-small-2 sonnet-4.3 deepseek-v3.2 qwen3-vl openai anthropic block mistral-ai alibaba linux-foundation deepseek agentic-ai coding-models reinforcement-learning model-performance model-optimization open-weights cli-tools multi-file-code-automation data-decontamination moe reward-models rl-stability guillaumelample b_roziere qtnx_ charliermarsh omarsar0 eliebakouch justinwaugh cwolferesearch pan

OpenAI Engineering sees a significant collaborative milestone with the launch of the Agentic AI Foundation under the Linux Foundation, uniting projects from Anthropic, OpenAI, and Block. Mistral released Devstral 2, a coding model with 123B parameters and open weights, offering a cost-effective alternative to Sonnet 4.3 and competitive performance against DeepSeek v3.2. The new Mistral Vibe CLI supports agentic coding workflows with rapid ecosystem integration. Alibaba introduced Soft Adaptive Policy Optimization (SAPO) for reinforcement learning tuning, improving stability and performance in Qwen3-VL across multiple tasks. Research highlights include the importance of data decontamination in RL and ongoing discussions on MoE RL stability and reward hacking mitigation.
Dec 08, 2025

not much happened today

glm-4.6v glm-4.6v-flash jina-vlm-2b hugging-face zhipu-ai jina-ai google-deepmind axiomprover fine-tuning multimodality model-optimization long-context mechanistic-interpretability formal-methods sequence-architectures reinforcement-learning lioronai akshay_pachaar _akhaliq ben_burtenshaw vllm_project prince_canuma zenmuxai eliebakouch theturingpost axiommathai neelnanda5 sarahookr

Claude Code Skills gains attention with a published talk and Hugging Face's new "skill" enabling one-line fine-tuning pipelines for models from ~0.5B to 70B parameters, supporting SFT, DPO, and GRPO, costing as low as ~$0.30 for small runs. Zhipu AI launches multimodal models GLM-4.6V (106B params MoE) and GLM-4.6V-Flash (9B dense), featuring 128k context and native multimodal function calling, with free Flash variant and API pricing detailed. Jina AI releases Jina-VLM (2B), a compact multilingual VLM excelling in diagrams and documents with top benchmark scores. At NeurIPS 2025, research highlights include Google's post-Transformer sequence architectures (Moneta, Yaad, Memora) showing up to 20% gains in long-context retrieval, AxiomProver's autonomous Lean system solving 9/12 Putnam 2025 problems rapidly, and mechanistic interpretability advances discussed by Chris Olah emphasizing scalable tooling.
Dec 05, 2025

not much happened today

vllm-0.12.0 gemma3n qwen3-omni qwen3-vl gpt-5.1-codex-max gemini-3-pro runway-gen-4.5 kling-video-2.6 vllm nvidia huggingface langchain-ai together-ai meta-ai-fair sonarsource openrouter runway gemini arena gpu-programming quantization multimodality agent-platforms reinforcement-learning static-analysis reasoning inference-infrastructure model-optimization economics audio video-generation jeremyphoward mervenoyann sydneyrunkle swyx maximelabonne

vLLM 0.12.0 introduces DeepSeek support, GPU Model Runner V2, and quantization improvements with PyTorch 2.9.0 and CUDA 12.9. NVIDIA launches CUDA Tile IR and cuTile Python for advanced GPU tensor operations targeting Blackwell GPUs. Hugging Face releases Transformers v5 RC with an any-to-any multimodal pipeline supporting models like Gemma3n and Qwen3-Omni. Agent platforms see updates from LangChain with content moderation and cost tracking, Together AI and Meta AI collaborate on RL for long-horizon workflows, and SonarSource integrates static analysis into AI codegen. Economic insights from OpenRouter highlight coding as a key AI application, with reasoning models surpassing 50% usage and market bifurcation between premium and open models. Additionally, Kling Video 2.6 debuts native audio capabilities, and Runway Gen-4.5, Qwen3-TTS, and Gemini 3 Pro advance multimodality.
Dec 04, 2025

OpenRouter's State of AI - An Empirical 100 Trillion Token Study

grok-code-fast gemini-3 gemini-3-deep-think gpt-5.1-codex-max openrouter deepseek anthropic google google-deepmind reasoning coding tokenization long-context model-architecture benchmarking agentic-ai prompt-engineering quocleix noamshazeer mirrokni

OpenRouter released its first survey showing usage trends with 7 trillion tokens proxied weekly, highlighting a 52% roleplay bias. Deepseek's open model market share has sharply declined due to rising coding model usage. Reasoning model token usage surged from 0% to over 50%. Grok Code Fast shows high usage, while Anthropic leads in tool calling and coding requests with around 60% share. Input tokens quadrupled and output tokens tripled this year, driven mainly by programming use cases, which dominate spending and volume. Google launched Gemini 3 Deep Think, featuring parallel thinking and achieving 45.1% on ARC-AGI-2 benchmarks, and previewed Titans, a long-context neural memory architecture scaling beyond 2 million tokens. These advances were shared by Google DeepMind and Google AI on Twitter.
Dec 03, 2025

not much happened today

kling-2.6 kling-o1 runway-gen-4.5 gemini-3 deepseek-v3.2 ministral-3 evoqwen2.5-vl hermes-4.3 intellect-3 openai anthropic google runway elevenlabs freepik openart deepseek mistral-ai alibaba nous-research video-generation audio-processing multimodality image-generation reasoning model-quantization sparse-attention model-pricing multimodal-models retrieval-augmentation model-training model-release

OpenAI's Code Red response and Anthropic's IPO are major highlights. In AI video and imaging, Kling 2.6 introduces native audio co-generation with coherent lip-sync, partnered with platforms like ElevenLabs and OpenArt. Runway Gen-4.5 enhances lighting fidelity, while Google's Gemini 3 Nano Banana Pro supports advanced image compositing. Open model releases include DeepSeek V3.2 with sparse attention and cost-effective pricing, and Mistral's Ministral 3 multimodal family with strong 14B variants. Retrieval and code models from Alibaba's EvoQwen2.5-VL and Nous Research's Hermes 4.3 show competitive performance with permissive licensing and HF availability. The community arena sees additions like INTELLECT-3 (106B MoE). "coherent looking & sounding output" and "auto-lighting to match scene mood" are noted advancements.
Dec 02, 2025

DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling

deepseek-v3.2 deepseek-v3.2-speciale gpt-5-high sonnet-4.5 gemini-3-pro deepseek_ai lm-arena agentic-ai reinforcement-learning large-context-windows model-benchmarking model-performance multi-agent-systems model-training model-deployment suchenzang teortaxestex

DeepSeek launched the DeepSeek V3.2 family including Standard, Thinking, and Speciale variants with up to 131K context window and competitive benchmarks against GPT-5-High, Sonnet 4.5, and Gemini 3 Pro. The release features a novel Large Scale Agentic Task Synthesis Pipeline focusing on agentic behaviors and improvements in reinforcement learning post-training algorithms. The models are available on platforms like LM Arena with pricing around $0.28/$0.42 per million tokens. Community feedback is mixed, praising the frontier reasoning capabilities but critiquing the chat UI experience. Key figures include Susan Zhang and Teortaxes who provided commentary on the release.
Dec 02, 2025

Mistral 3: Mistral Large 3 + Ministral 3B/8B/14B open weights models

mistral-large-3 ministral-3 clara-7b-instruct gen-4.5 claude-code mistral-ai anthropic apple runway moondream sparse-moe multimodality benchmarking open-source model-licensing model-performance long-context inference-optimization instruction-following local-inference code-generation model-integration anjney_midha _akhaliq alexalbert__ _catwu mikeyk

Mistral has launched the Mistral 3 family including Ministral 3 models (3B/8B/14B) and Mistral Large 3, a sparse MoE model with 675B total parameters and 256k context window, all under an Apache 2.0 open license. Early benchmarks rank Mistral Large 3 at #6 among open models with strong coding performance. The launch includes broad ecosystem support such as vLLM, llama.cpp, Ollama, and LM Studio integrations. Meanwhile, Anthropic acquired the open-source Bun runtime to accelerate Claude Code, which reportedly reached a $1B run-rate in ~6 months. Anthropic also announced discounted Claude plans for nonprofits and shared insights on AI's impact on work internally.
Nov 26, 2025

not much happened today

claude-opus-4.5 qwen-3-4b qwen-3-8b qwen-3-14b deepseek-r1 anthropic booking.com perplexity-ai langchain claude scaling01 deepseek qwen prefect agent-systems multi-agent-systems reasoning benchmarking cost-efficiency model-optimization long-context memory-management reinforcement-learning model-performance multi-agent-communication latent-representation inference-cost software-integration jeremyphoward alexalbert__ omarsar0 lingyang_pu dair_ai

Anthropic introduces durable agents and MCP tasks for long-running workflows, with practical engineering patterns and integrations like Prefect. Booking.com deploys a large-scale agent system improving customer satisfaction using LangGraph, Kubernetes, GPT-4 Mini, and Weaviate. Perplexity rolls out user-level memory and virtual try-on features. Claude Opus 4.5 leads on LisanBench and Code Arena WebDev benchmarks with mixed community feedback on its "thinking" and "non-thinking" modes, while improving cost-efficiency and UX with batch APIs and context compaction. Research on multi-agent systems shows LatentMAS reduces communication tokens by 70-84% and improves accuracy using Qwen3 models, and reasoning trace distillation achieves significant token reduction with maintained accuracy, highlighting the importance of reasoning trace style.
Nov 25, 2025

Black Forest Labs FLUX.2 [pro|flex|dev|klein]: near-Nano Banana quality but Open Weights

flux-2 flux-2-dev claude-opus-4.5 gpt-5.1 gemini-3-pro black-forest-labs anthropic huggingface multi-reference-support variational-autoencoder image-generation open-weights agentic-coding token-efficiency benchmarking prompting model-performance

Black Forest Labs' FLUX.2 release features Multi-Reference Support for up to 4 Megapixel output and up to 10 images with consistency, including four form factors: Pro, Flex, Dev (32B Open Weight model), and Klein (TBA Open Weights). The new FLUX.2 - VAE introduces a variational autoencoder optimizing learnability, quality, and compression. Meanwhile, Anthropic's Claude Opus 4.5 demonstrates strong performance and efficiency, scoring 70 on Artificial Analysis, tying with GPT-5.1 high and trailing Gemini 3 Pro (73). Opus 4.5 excels in agentic coding benchmarks and research evaluations, with notable token efficiency and reduced running costs. "Opus 4.5 leads Gemini 3 Pro on SWE-Bench Verified and tops the AICodeKing leaderboard," and it shows strong QA and systematic review capabilities. Anthropic also released a dense prompting guide for Opus 4.5.
Nov 24, 2025

Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus

claude-opus-4.5 gemini-3-pro gpt-5.1-codex-max opus-4.1 sonnet-4.5 anthropic amazon google anthropic coding agents tool-use token-efficiency benchmarking api model-pricing model-performance effort-control context-compaction programmatic-tool-calling alexalbert__ btibor91 scaling01 klieret

Anthropic launched Claude Opus 4.5, a new flagship model excelling in coding, agents, and tooling with a significant 3x price cut compared to Opus 4.1 and improved token efficiency using 76% fewer output tokens. Opus 4.5 achieved a new SOTA on SWE-bench Verified with 80.9% accuracy, surpassing previous models like Gemini 3 Pro and GPT-5.1-Codex-Max. The update includes advanced API features such as effort control, context compaction, and programmatic tool calling, improving tool accuracy and reducing token usage. Claude Code is now bundled with Claude Desktop, and new integrations like Claude for Chrome and Excel are rolling out. Benchmarks show Opus 4.5 breaking the 80% barrier on SWE-bench Verified and strong performance on ARC-AGI-2 and BrowseComp-Plus.

See all issues

Let's Connect

If you want to get in touch with me about something or just to say hi, reach out on social media or send me an email.

GitHub /
X (@smol_ai) /
swyx at smol dot ai