AINews
subscribe / issues / tags /

AINews

by smol.ai

How over 150k top AI Engineers keep up, every weekday.

We summarize top AI discords + AI reddits + AI X/Twitters, and send you a roundup each day!

"Highest-leverage 45 mins I spend everyday" - Soumith

" best AI newsletter atm " and " I'm not sure that enough people subscribe " - Andrej

"genuinely incredible" - Chris

"surprisingly decent" - Hamel

Thanks to Pieter Levels for the Lex Fridman feature!

Last 30 days in AI

Invalid regex
See all issues
  • Jun 19
    not much happened today
    glm-5.2 opus-4.8 gpt-5.5 nous-research hugging-face cloudflare open-weight-models coding agent-engineering agent-fan-out loop-engineering model-serving infrastructure software-engineering model-evaluation open-agent-stack session-compression patrick_toulme thomas_wolf andrew_ng meryem_arik banteg graham_neubig harrison_chase jared_from_cognition omar_sanseviero teknium
    GLM-5.2 emerges as a leading open-weight coding model rivaling Opus 4.8 and GPT-5.5 in software engineering tasks, emphasizing the strategic importance of open models for provider competition, on-prem deployment, and fine-tuning rights. Experts like Patrick Toulme and Thomas Wolf highlight its frontier capabilities and structural impact on the AI ecosystem. The usability of GLM-5.2 heavily depends on serving infrastructure and agent harnesses, with tools like sglang cookbooks and deepagents code enhancing evaluation and deployment. In agent engineering, the focus shifts to orchestration patterns such as agent fan-out and loop engineering, with Hermes Agent v0.17.0 advancing as a robust open agent stack supported by community-driven deployments. Additionally, Cloudflare is becoming a significant player in agent infrastructure.
  • Jun 18
    not much happened today
    glm-5.2 opus-4.8 gpt-5.5 laguna-m.1 north-mini-code codex zhipu hugging-face llama-cpp unsloth poolsideai cohere ollama openai cursor_ai claude cognition sparse-attention 1m-token-inference open-weight-models model-architecture long-context mixture-of-experts quantization local-deployment workflow-automation code-agents software-configuration-management automation-primitives security model-harness agentic-coding rasbt jeremyphoward matvelloso artificialanlys zixuanli_ _xjdr gneubig _catwu
    GLM-5.2 from Zhipu emerged as a leading open-weight model with innovative IndexShare sparse-attention enabling efficient 1M-token inference, praised as comparable to GPT-5.5 and Opus 4.8 but lacking vision support. Other notable open models include Laguna M.1 by Poolside AI, a 70-layer sparse MoE optimized for long-horizon coding, and North Mini Code by Cohere with 4-bit quantization and local deployment support via Ollama. The focus is shifting from standalone models to integrated systems combining model + harness + memory + SCM, exemplified by Noumena Code / ncode addressing challenges in concurrent code agent workflows. Automation tools like Codex Record & Replay, Cursor's /automate, and Artifacts in Claude Code enhance teachability, reusability, and security in AI-assisted coding workflows.
  • Jun 17
    Midjourney Medical: scan your organs like you step on a scale
    Midjourney unveiled a new medical imaging/scanning system called the Midjourney Scanner, described as radiation-free, magnet-free, fast, and low-cost, but requiring a water immersion tank and having coarser resolution than CT/MRI. The announcement included a technical dive and a physical demo, sparking enthusiasm and competitive comparisons with other AI hardware efforts. Technical speculation suggested future design directions involving distributed detectors and real-time imaging, highlighting Midjourney's ambitious hardware roadmap in medical imaging.
  • Jun 16
    GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs
    glm-5.2 z.ai lmsys deepseek cloudflare openrouter ollama baseten deepinfra fireworks notion coding agentic-ai long-context mixture-of-experts sparse-attention speculative-decoding multi-token-prediction model-benchmarking inference-optimization mervenoyann sentdex scaling01 omarsar0 teortaxestex
    Z.ai released GLM-5.2, an MIT-licensed open-weight frontier model targeting coding and long-horizon agentic tasks with a 1M-token context window and two reasoning-effort modes. It features a 744B-parameter mixture-of-experts architecture with 40B active parameters per token, built on DeepSeek Sparse Attention extended by IndexShare, and supports improved multi-token prediction (MTP) for speculative decoding. The model achieved strong leaderboard placements, including #3 on FrontierSWE, #1 on Design Arena, and #1 open model on Agent Arena, with ecosystem support from platforms like Transformers, vLLM, SGLang, Cloudflare Workers AI, OpenRouter, Ollama Cloud, Baseten, DeepInfra, Fireworks, and Notion. Early testers praised its potential as a substitute for Opus/GPT-class workflows, though some called for further evaluation and long-horizon validation.
  • Jun 12
    not much happened today
    claude-fable-5 mythos-5 gpt-5.5 claude-code fable-5 codex opus-4.8 kimi-k2.7-code anthropic artificial-analysis datacurve moonshot model-sovereignty export-controls coding-agent-evaluation benchmarking benchmark-gaming harness-quality benchmark-saturation open-source-models natolambert theo cohere kunchenguid clementdelangue dejavucoder ofirpress ramplabs
    Anthropic suspended access to Claude Fable 5 and Mythos 5 due to US export controls, sparking a debate on model sovereignty and geopolitical risks for frontier AI vendors. Artificial Analysis updated its coding agent benchmark, replacing SWE-Bench Pro with DeepSWE, reshuffling rankings with Claude Code + Fable 5 [max] leading. Discussions highlighted the importance of harness quality versus pure model capability and concerns over benchmark saturation and realism. Additionally, Moonshot released the open-source model Kimi K2.7-Code.
  • Jun 11
    not much happened today
    claude-fable-5 nanogpt anthropic recursive-si nvidia model-governance model-transparency benchmarking automated-research optimization open-sourcing model-behavior cost-efficiency richard_socher
    Anthropic reversed its covert degradation policy on Claude Fable 5 after public backlash, sparking debates on governance, transparency, and access to frontier AI models. The model shows strong capabilities with mixed benchmark results, including 87.8% on WeirdML and top ranking on FrontierSWE, but practical usage highlights cost and inconsistent behavior. Separately, Recursive SI, led by Richard Socher, released an automated open-ended discovery system achieving state-of-the-art results on NVIDIA SOL-ExecBench, NanoGPT Speedrun, and NanoChat autoresearch, with open-sourced discoveries and improved efficiency metrics.
  • Jun 11
    not much happened today
    fable-5 mythos claude-fable-5 gpt-5.5-pro anthropic epoch-ai langchain export-control national-security agentic-capabilities model-neutrality harness observability trace-analysis evaluation-infrastructure behavioral-correction fine-tuning fchollet simonw hwchase17 nikesharora mignano sauvast rohit4verse dair_ai omarsar0
    Anthropic's Fable/Mythos export-control crisis dominates AI news, highlighting the intersection of national security and frontier model access. Technical voices like François Chollet criticize opaque regulatory actions and advocate for standardized benchmarks for agentic capabilities. Epoch AI reports Claude Fable 5 surpassing GPT-5.5 Pro on the Epoch Capabilities Index, underscoring tensions between cutting-edge AI and regulatory constraints. The concept of model neutrality is evolving from philosophy to architecture, emphasizing harness, context, memory, and routing for multi-model fungibility, with contributions from voices like hwchase17, Nikesh Arora, and mignano. Agent systems are transitioning from demos to production with a focus on observability, trace analysis, and evaluation infrastructure, exemplified by LangChain's LangSmith Engine and fine-tuned judges for behavioral correction signals. Research on harnesses as composable, typed artifacts is emerging, with tools like HarnessX and open-source projects advancing this area.
  • Jun 10
    not much happened today
    fable-5 mythos anthropic model-performance trust data-retention benchmarking agentic-ai coding policy darioamodei natolambert martin_casado drfeifei antirez clementdelangue deanwball hlntnr _arohan_ dbahdanau gergelyorosz scaling01 dbreunig omarsar0 yacinemtb mchlhess jasonbotterill lvwerra lechmazur kimmonismus walden_yan hrishioa
    Anthropic faced backlash for silently degrading AI research capabilities in its Fable/Mythos models without clear disclosure, raising concerns about trust, reproducibility, and enterprise data retention policies. Despite controversy, Fable 5 demonstrated strong benchmark performance, leading in agentic and coding tasks with high scores on Agent Arena, SimpleBench, CADGenBench, and PACT. Dario Amodei published a policy advocating stronger frontier AI oversight amid these tensions.
  • Jun 09
    Anthropic Claude Fable 5
    claude-fable-5 claude-mythos-5 claude-opus-4.8 gpt-5.5 anthropic cursor_ai cognition benchmarking software-engineering knowledge-work scientific-research vision context-windows model-pricing sdk rate-limiting mikeyk scaling01
    Anthropic released two major models: Claude Fable 5 for general availability and Claude Mythos 5 for restricted access, with fallback to Claude Opus 4.8 for sensitive queries. Fable 5 features a 1M-token context window and pricing at $10/million input tokens and $50/million output tokens. It leads benchmarks in software engineering, knowledge work, scientific research, and vision, outperforming GPT-5.5 and setting new state-of-the-art scores on CursorBench, FrontierCode, Terminal-Bench 2.1, and Artificial Analysis Intelligence Index. The rollout includes Pro, Max, Team, and Enterprise plans with temporary usage credits due to capacity constraints. Middleware SDK support is available in Python, TypeScript, Go, Java, and C#.
  • Jun 08
    not much happened today
    opus-4.8 gemma-4 cognition frontiercode moonshot google claudedevs magicpath langsmith modal coding-evaluation agent-control verification agent-ergonomics sandbox-environments local-inference workflow-optimization cli-tools plugin-integration persistent-memory swyx dzhng claudecode bcherny reach_vb omarsar0 gneubig hamelhusain angaisb_
    FrontierCode benchmark by Cognition highlights the challenge of coding tasks with the best model, Opus 4.8, scoring only about 13% on the hardest subset, indicating coding is less solved than benchmarks suggest. The trend toward using loops as a control metaphor for coding agents is prominent, with emphasis on clear goals, verification, and iteration, though some experts caution about overreliance on loops. Agent ergonomics are improving with observability dashboards, sandbox environments, and workflow tools from ClaudeDevs, MagicPath, LangSmith, and Modal. Kimi by Moonshot released major updates including a stronger coding agent and a desktop agent product supporting up to 300 local sub-agents. Google advanced efficient local deployment with upgrades to Gemma 4 checkpoints.
  • Jun 05
    not much happened today
    claude-mythos opus-4.8 opus-4.7 gpt-5.5 gemini-3.1-pro gemini-3.5-flash claude-opus-4.7 anthropic sakana-ai meta-ai-fair princeton recursive-self-improvement benchmarking agent-evaluation long-horizon-tasks reliability reinforcement-learning sample-efficiency economically-meaningful-tasks agent-coherence anti-reward-hacking tooling rl-environments kimmonismus lechmazur teortaxestex hardmaru andrew_n_carr steverab pauliusztin_
    Anthropic's Mythos/Opus cycle sparked mixed reactions with praise for Claude Mythos's one-shot workflows and concerns over Opus 4.8 benchmark regressions. Opus 4.7 showed strong chemistry task performance, "making Claude a chemist." Sakana AI launched an RSI Lab focusing on recursive self-improvement under compute constraints, marking RSI as a formal research program. New benchmarks like Agents' Last Exam (ALE) and SWE-Marathon test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. Princeton's ICML 2026 paper found models like GPT 5.5, Gemini 3.1 Pro / 3.5 Flash, and Claude Opus 4.7 still lack meaningful reliability improvements. Tooling trends favor RL-environment-style frameworks for agent evaluation, exemplified by Meta's OpenEnv.
  • Jun 04
    not much happened today
    nemotron-3-ultra nemotron-3.5-asr claude-opus-4 mythos-preview nvidia anthropic togethercompute baseten modal vllm_project fireworksai_hq ollama wandb cline primeintellect nousresearch mixture-of-experts long-context model-quantization agentic-ai streaming-speech asr low-precision-training benchmarking recursive-self-improvement code-generation model-speedup piotrz_zelasko
    NVIDIA released Nemotron 3 Ultra, a fully open 550B MoE model with 55B active parameters and 1M context, optimized for long-running agent tasks with up to 5x speedup and 30% cost reduction. It features hybrid Mamba/attention, LatentMoE, native MTP, and was pretrained on 20T tokens using NVFP4 low-precision format. Benchmarks show strong performance with 47.7 Intelligence Index and 400+ output tokens/sec. The model is supported across major serving platforms. Additionally, Nemotron 3.5 ASR is an open streaming ASR model with 0.6B parameters, supporting 40 language-locale combinations and sub-100ms latency, designed for voice agents. Anthropic highlighted early signs of recursive self-improvement (RSI) in AI, with Claude models authoring 80%+ of merged code and engineers shipping 8x more code. Claude Opus 4 achieved 3x speedup on training scripts, while Mythos Preview reached ~52x speedup and provided better research suggestions than humans 64% of the time.
  • Jun 02
    not much happened today
    mai-thinking-1 mai-image-2.5 mai-code-1-flash gemma-4-12b microsoft google vllm-project ollama llama-cpp model-training reinforcement-learning model-architecture multimodality model-deployment model-efficiency fine-tuning on-device-ai eliebakouch nrehiew_ mustafasuleyman minjiyoon90 lateinteraction harold_matmul googlegemma googleaidevs mtschannen armandjoulin osanseviero
    Microsoft released the detailed technical report for MAI-Thinking-1, a generalist reasoning model trained without third-party distillation, achieving 97% on AIME 2025 and outperforming Sonnet 4.6 in human preference tests. The report was praised for transparency, revealing no synthetic data use, a unique scaling ladder recipe, and detailed training data composition including 50% code and 17.5% STEM. Microsoft also introduced Frontier Tuning for workflow-specific model adaptation, claiming efficiency gains up to 10× and GPT-5.4-level quality in Excel tasks, alongside new models like MAI-Image-2.5 and MAI-Code-1-Flash. Meanwhile, Google launched Gemma 4 12B, an Apache 2.0 multimodal model with an innovative encoder-free architecture designed for on-device use with 16GB VRAM, collapsing vision and audio encoders into the LLM backbone, receiving positive community feedback and immediate tooling support.
  • Jun 02
    Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows
    mai-thinking-1 mai-code-1-flash holo-3.1 qwen-35b sonnet-4.6 claude-code codex microsoft openrouter fal baseten hcompany_ai teksedge nous-research teknim cognition windsurf perplexity-ai mixture-of-experts context-windows benchmarking reinforcement-learning prompt-optimization agentic-ai local-inference model-family-expansion model-reporting agent-native-devices software-development model-optimization hybrid-inference desktop-agents model-quantization mustafasuleyman eliebakouch hannahajishirzi asadovsky bj2rn lateinteraction lakshyaaagrawal theturingpost kimmonismus yusuf_i_mehdi pierceboggan lukehoban nielsrogge russelljkaplan
    Microsoft introduced MAI-Thinking-1, a 35B parameter MoE model with 256K context, achieving 97% on AIME 2025 and outperforming Sonnet 4.6 in human preference tests. The broader 7-model MAI family spans reasoning, code, image, speech, and voice, with third-party availability on OpenRouter, fal, and Baseten. The detailed 109-page technical report revealed insights on scaling, MFU, RL/post-training, and data curation, highlighting no third-party distillation and advanced prompt optimization techniques. Microsoft emphasized agent-native devices and local inference with projects like Project Solara / Scout and the Surface RTX Spark Dev Box, alongside software innovations such as the Copilot desktop app and MAI-Code-1-Flash integration. Meanwhile, local-first computer-use agents like Holo 3.1 (Qwen-based, 0.8B to 35B parameters) support laptops and small workstations with optimized formats and strong benchmark results. Desktop shells for agents, including Hermes Desktop, Devin Desktop, and agent-neutral approaches compatible with Devin, Claude Code, and Codex, are proliferating, with hybrid local/cloud execution becoming the default architecture as seen in Perplexity Computer's hybrid agentic inference.
  • Jun 01
    not much happened today
    cosmos-3 nemotron-3-ultra minimax-m3 nvidia runway novita vercel cloudflare openclaude flowith omnimodal-models mixture-of-experts autoregressive-models diffusion-models structured-prompts fine-tuning open-weight-models multimodality agent-models benchmarking model-serving context-windows token-efficiency kimmonismus clementdelangue artificialanalysis scaling01 ctnzr caspar_br eliebakouch pbdtokenrouter rauchg gitlawb notjazii lostinlatencyx zhihufrontier
    NVIDIA led open-source AI model releases with Cosmos 3, a comprehensive omnimodal world model unifying language, image, video, audio, and action using a Mixture-of-Transformers design, and Nemotron 3 Ultra, a 550B parameter open-weight model noted for high serving speed and strong evaluation performance. The Cosmos Coalition was launched to foster an open ecosystem for physical AI world models. Meanwhile, MiniMax M3 debuted as a multimodal agent/coding model with 1M context and strong benchmark scores, gaining rapid ecosystem support from vendors like Novita and Vercel AI Gateway. However, MiniMax M3 showed some inefficiencies such as high token consumption and verbose self-check loops. These developments highlight advances in open physical AI, multimodality, and agent models with significant community and infrastructure engagement.
  • May 29
    not much happened today
    claude-opus-4.8 gpt-5.5 qwen kimi deepseek anthropic huggingface langchain vllm_project reinforcement-learning tokenization agentic-ai api model-optimization long-context rust performance-optimization multi-agent-systems prompt-engineering jeremyphoward leo_linsky clementdelangue johnschulman2 omarsar0 hwchase17 ofirpress scaling01
    Anthropic rolled out Claude Opus 4.8, which shows incremental improvements but mixed benchmark results, including better cooperation and coding behavior but some regressions in document parsing. Platform updates include mid-conversation system instructions enhancing long agent sessions, though API pricing remains a concern. A Hugging Face analysis revealed a critical bug in multi-turn reinforcement learning training loops involving tokenization mismatches, with a proposed "Token-In, Token-Out" fix. Agent harness design is evolving as a key optimization area, with LangChain's Deep Agents v0.6 achieving strong performance at much lower cost, and vllm_project releasing native weight syncing APIs and a Rust BPE tokenizer to improve tokenization efficiency. Debate continues on the value of multi-agent systems, with some seeing them as speedups and others expecting capability breakthroughs.
  • May 28
    Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows
    claude-opus-4.8 claude-opus-4.7 gpt-5.5 anthropic altimeter dragoneer greenoaks sequoia andonlabs model-release reinforcement-learning agentic-ai model-evaluation long-context model-optimization fine-tuning multitasking parallel-processing dan_shipper scaling01 zephyr_z9 teortaxes_tex kimmonismus
    Anthropic announced a massive $65B Series H financing at a $965B valuation, led by Altimeter, Dragoneer, Greenoaks, and Sequoia, with run-rate revenue surpassing $47B. They launched Claude Opus 4.8, an update to Opus 4.7 featuring "sharper judgment," "more honesty," and longer autonomous work at the same price. Anthropic also introduced Dynamic Workflows in Claude Code, enabling orchestration of hundreds of parallel subagents for large tasks, available in research preview across multiple platforms. Opinions on Opus 4.8 vary, with some praising it as a major leap and others viewing it as incremental or catch-up to OpenAI's GPT-5.5 family.
  • May 26
    not much happened today
    qwen-3.7 claude-opus-4.6 gpt-5.5 mythos quest-2b-35b deepseek google-deepmind langchain-ai anthropic openai alibaba sakana-ai stanford oxford ai2 harness-engineering agent-infrastructure coding-benchmarks security-guidance long-horizon-memory context-compression sleep-phase math-problem-solving fact-seeking citation-grounding science-evaluation sebastienbubeck
    Harness engineering is emerging as the key differentiator for coding agents, emphasizing the stack of model + harness + eval loop over just stronger base models. DeepSeek is building a harness team to optimize interaction and verification loops, while Google's Gemini Managed Agents and LangChain formalize harness concepts like context governance and dynamic skill routing. New benchmarks like DeepSWE align closely with real developer experience, with Qwen3.7 Max and Claude Opus 4.6 showing strong agentic coding performance. Anthropic introduced a security-guidance plugin for Claude Code reducing security PR comments by 30–40%, and OpenAI highlighted GPT-5.5 in Codex for improved document parsing. In research, Claude Mythos solved Erdős problem #90 with a cleaner proof path than previous models, showing latent capabilities unlocked by appropriate harnesses. The paper "Language Models Need Sleep" proposes a sleep-like consolidation phase for long-horizon memory, addressing bottlenecks in persistent context storage. Open research agents like QUEST (2B–35B parameters) advance long-horizon fact-seeking and citation grounding, while the CUSP benchmark from Sakana/Stanford/Oxford/AI2 evaluates current model capabilities in science.
  • May 26
    not much happened today
    eagle-3.1 unigram-tokenizer qwen-3.5 deepseek-v4-pro mimo deep-agents-v0.6 397b-parameter-model eaglecorp vllm_project perplexity_ai alibaba lightseek nvidia mooncake flashattention kimmonismus deepseek xiaomi langchain baseten trajectory clay harvey decagon mercor rogo rlm inference-optimization long-context speculative-decoding tokenization attention-mechanisms kv-cache cache-hierarchy agent-engineering model-harness-memory-fit continual-learning quantization autoscaling memory-centric-agents evaluation-automation kimmonismus _luofuli vtrivedy10
    Inference optimization is increasingly architectural, with EAGLE 3.1 improving speculative decoding and long-context handling, collaborating with vLLM and TorchSpec. Perplexity open-sourced a rebuilt Unigram tokenizer cutting CPU use by 5–6× and achieving 63 µs at 514 tokens. Qwen3.5 hits 580 tokens/s via joint efforts from Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Price cuts in APIs from Chinese labs are sustainable due to structural KV-cache and attention improvements, exemplified by DeepSeek V4-Pro and Xiaomi MiMo reducing caching costs significantly. Agent engineering shifts focus from model quality to model-harness-memory fit, with LangChain releasing Deep Agents v0.6 and tools like LangSmith Engine automating evaluation loops. Trajectory launched a continual learning platform with $15M funding and partners like Clay and Harvey, supporting large models including a 397B-parameter model deployed on autoscaled H100 infrastructure. Open-source memory-centric agents and minimal training harnesses also gained attention.
  • May 21
    not much happened today
    raev2 gated-deltanet-2 kda mamba-3 dclm nvidia openai nous-research representation-learning tokenization linear-attention long-context mechanistic-interpretability math data-filtering agent-infrastructure language-modeling commonsense-reasoning 1jaskiratsingh recatm sainingxie ahatamiz1 rasbt nousresearch tatsu_hashimoto goodfireai markchen90 wtgowers memecrashes cloneofsimo lvwerra
    RAEv2 advances representation-first tokenization with >10x faster convergence and improved generation, tested on text-to-image and world models. NVIDIA's Gated DeltaNet-2 innovates linear attention with channel-wise gates, outperforming KDA and Mamba-3 at 1.3B parameters on language modeling and reasoning tasks. Studies on subword tokenization reveal only some benefits at scale, while data filtering research suggests that with enough compute, no filtering may be optimal at around 1e30 FLOPs. Mechanistic interpretability updates propose clustering features by joint firing patterns for better geometry understanding. OpenAI's AI-assisted breakthrough on an Erdős unit-distance math problem sparks debate on AI's role in mathematical research. Harnesses remain key for capability improvements in agent infrastructure.
  • May 18
    not much happened today
    claude-code codex composer-2.5 langchain cognition anthropic openai microsoft cursor agent-automation agent-observability ci-cd prompt-caching remote-execution verification decomposition feedback-loops coding-agents model-efficiency instruction-following krishdpi walden_yan russelljkaplan fchollet gabriberton palashshah shannholmberg
    Agent infrastructure is advancing with LangSmith Engine providing CI/CD loops for agents and SmithDB enabling low-latency querying for observability. Cognition's Devin Auto-Triage offers persistent automation for bug triage with memory and subagent structures. Anthropic improves Claude Code for large codebases with prompt cache diagnostics and faster modes, while OpenAI enhances Codex workflows with remote execution and plugins. Microsoft released remote control for GitHub Copilot CLI and VS Code. The community emphasizes verification, decomposition, and feedback loops over prompt cleverness for coding agents. Cursor's Composer 2.5 is highlighted as a strong new coding model, with plans for a larger model trained with SpaceXAI using 10× more compute on Colossus 2 hardware, praised for efficiency and collaboration improvements.
  • May 18
    Google I/O 2026: Gemini 3.5 Flash, Omni, and Google’s Agent Stack
    gemini-3.5-flash gemini-3.1-pro gemini-3.5 gemini-omni google google-deepmind geminiapp agentic-ai multimodality video-generation model-performance benchmarking context-windows model-optimization model-scaling instruction-following api model-efficiency cost-analysis philschmid jeffdean
    Google announced at I/O the repositioning of Gemini as a consumer AI and developer/agent platform with three key releases: Gemini 3.5 Flash for fast agentic and coding tasks, Gemini Omni for multimodal generation and editing including video, and the expanded Antigravity 2.0 agent stack. Google reports processing over 3.2 quadrillion tokens per month, a 7x increase year-over-year, with 900M+ monthly Gemini users across 230+ countries and 70+ languages. Gemini 3.5 Flash features a 1M-token context window, 65k max output tokens, 4 thinking levels, and "thought preservation" across turns, outperforming Gemini 3.1 Pro on multiple benchmarks and running up to 12x faster in Antigravity. Independent benchmarks show Gemini 3.5 Flash scoring 55 on the Intelligence Index, with higher costs than previous versions. Gemini Omni Flash supports text, image, video, and audio inputs for generative media tasks, available now for paid users.
  • May 15
    not much happened today
    openai-5.4 openai-5.5 cerebras openai inference model-serving compute-scarcity model-routing hardware-architecture trillion-parameter-models ishanit5 dee_bosa apoorv03 bob_komin
    Cerebras made headlines with its IPO, marking a significant milestone for the company known for its contrarian hardware approach. The Cerebras CFO Bob Komin emphasized the company's capability to serve trillion-parameter models, including internal OpenAI 5.4 and 5.5 models, pushing back against the notion that Cerebras only supports small models. Investor Ishan N. Taneja praised Cerebras for its persistence and execution, calling their chip a "banger." The IPO is seen as a validation of Cerebras's long-term strategy in inference infrastructure, highlighting themes like compute scarcity, inference demand, and model routing.
  • May 14
    not much happened today
    codex chatgpt openai github microsoft nous-research moonshot-ai langchain prime-intellect agent-infrastructure agent-first-ux remote-ssh programmatic-access-tokens sandboxing continual-learning agent-trace-data multi-agent-workflows ide-integration browser-extensions hwchase17 caspar_br bentannyhill jakebroekhuizen willccbb
    OpenAI expanded Codex integration with the ChatGPT mobile app enabling remote task management and introduced Remote SSH, hooks, and programmatic tokens for enterprise automation. The IDE ecosystem is shifting to "agent-first" UX with GitHub Copilot App preview and VS Code launching a multi-agent workflow window. Open-source agents like Nous/Hermes integrated Codex runtime, and Kimi released a web bridge extension supporting multiple coding agents. LangChain released significant agent infrastructure including SmithDB for agent trace data and LangSmith Engine for trace analysis and continual learning, launching LangChain Labs to improve agents via production trace feedback loops.
  • May 13
    not much happened today
    claude codex langsmith-engine smithdb duet-agent multi-stream-llm delta-mem star-elastic cline langchain notion cursor nous-research nvidia datology agent-infrastructure developer-platforms observability long-running-state streaming orchestration pretraining-efficiency model-architecture external-memory post-training-compression data-curation vision-language-models jonas_geiping siddharth_joshi pratyush_maini
    Cline, LangChain, Notion, and Cursor advanced agent infrastructure and developer platforms with innovations like Cline SDK, LangSmith Engine, SmithDB (offering 12–15× faster observability), and Notion's External Agents API integrating third-party agents such as Claude and Codex. Agent UX trends emphasize long-running state, streaming, and orchestration over chat, with tools like Duet Agent and VS Code Agents window enhancing durable execution and inspectable states. Research highlights include Nous Research's Token Superposition Training achieving 2–3× speedup in pretraining, a multi-stream LLM architecture for parallel reasoning by Jonas Geiping et al., and δ-mem external memory improving benchmark scores. NVIDIA's Star Elastic offers post-training model compression at 360× lower cost than pretraining, while Datology focuses on data curation for vision-language models.
  • May 12
    not much happened today
    gemini-3.1-pro gpt-5.5 opus-4.7-xhigh agent-moderncolbert google-deepmind lighton nous-research research-benchmarks math medical-benchmarks agentic-systems program-synthesis retrieval-augmentation training-optimization superoptimization scaling-laws training-efficiency gpu-optimization attention-mechanisms soohak polynoamial torchcompiled leloykun che_shr_cat jjitsev omarsar0
    Research-level reasoning benchmarks are advancing with 439 new math problems from 64 mathematicians and expanded medical benchmarks in Medmarks v1.0 covering 30 benchmarks and 61 models. Google DeepMind's AI Co-Mathematician achieves 48% on FrontierMath Tier 4, while Gemini 3.1 Pro improves physics benchmark scores significantly. GPT-5.5 high/xhigh outperforms Opus 4.7 xhigh on program synthesis tasks. Retrieval benchmarks favor smaller models like LightOn's Agent-ModernColBERT with 149M parameters. Training optimization advances include SOAP/Muon-style updates reducing training steps, and a Lean4-to-TileLang superoptimizer achieving 1.8× speedup on A100 GPUs. Scaling laws are reconsidered with arguments for measuring in bytes rather than tokens. New training-time efficiency methods like Lighthouse Attention enable subquadratic training wrappers removable before deployment.
  • May 11
    not much happened today
    gpt-5.5 codex thinking-machines openai anthropic multimodality real-time-interaction visual-proactivity deployment cybersecurity threat-modeling automation continuous-audio-video-text-processing security-models field-engineering enterprise-ai johnschulman2 soumithchintala chillee liliyu_lili rown kimmonismus giffmana swyx eliebakouch gdb sama therundownai lukolejnik matvelloso
    Thinking Machines previewed their new native interaction models designed for full-duplex multimodal interaction enabling real-time concurrent listening, speaking, watching, thinking, searching, and reacting, marking a shift beyond turn-based AI. This approach emphasizes continuous audio, video, and text processing, with innovations like visual proactivity and background tool use, implemented using SGLang. Meanwhile, OpenAI announced the OpenAI Deployment Company, a new unit with 150 Forward Deployed Engineers and $4B initial investment to help enterprises deploy frontier models, signaling a move into the deployment layer of the AI economy. OpenAI also launched Daybreak, a security-focused initiative integrating GPT-5.5 and Codex for cyber defense, threat modeling, and automated patching, offering differentiated access tiers including GPT-5.5-Cyber. This contrasts with Anthropic's more restrictive cyber approach, highlighting tensions in AI security strategies.
  • May 08
    not much happened today
    gpt-5.5 gpt-image-2 gpt-5.5-pro gpt-5.5-instant gpt-realtime-2 gpt-5.5-cyber codex zaya1-74b-preview zaya1-vl-8b qwen3-omni openai zyphra amd deepseek vllm_project model-release model-training mixture-of-experts inference model-optimization sandboxing alignment cybersecurity agent-runtime throughput quantization telemetry real-time-detection reach_vb dhh gdb patience_cave ithilgore cryps1s sama deredleritt3r
    OpenAI rapidly expanded the GPT-5.5 family with multiple variants including gpt-image-2, GPT-5.5 Pro, and GPT-5.5 Cyber, receiving positive feedback for efficiency and usability. Codex evolved into a long-running agent runtime with a new /goal mechanism, achieving 61% success on ARC-AGI-3 games after extensive testing. OpenAI also introduced cybersecurity-focused models like GPT-5.5-Cyber targeting enterprise and government sectors. Meanwhile, Zyphra released the open-model ZAYA1-74B-Preview, a 74B parameter mixture-of-experts model trained on AMD hardware under Apache 2.0 license, alongside a vision-language model ZAYA1-VL-8B. Inference infrastructure competition intensified with vLLM updates improving throughput and latency, including support for DeepSeek V4 and enhanced quantization/backends.
  • May 07
    GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs
    gpt-realtime-2 gpt-5.5 codex openai anthropic goodfireai scale-ai voice-models streaming-translation transcription benchmarking context-windows browser-automation cybersecurity interpretability neural-geometry manifolds ai-safety rlhf micahcarroll milesbrundage ryanpgreenblatt
    OpenAI released GPT-Realtime-2, a voice model with GPT-5-class reasoning, tool use, interruption handling, and extended context windows up to 128K tokens, achieving top scores on Big Bench Audio and Conversational Dynamics benchmarks. They also launched a Chrome plugin for Codex enabling browser control and multitasking, and introduced GPT-5.5 with Trusted Access for Cyber for secure defensive workflows and red teaming. Anthropic introduced Natural Language Autoencoders for interpreting model activations as human-readable text, aiding interpretability and debugging, while Goodfire proposed a neural geometry research agenda focusing on manifolds as primitives for neural network behavior. Anthropic also announced The Anthropic Institute to advance AI safety and economic resilience research.
  • May 06
    Anthropic-SpaceXai's 300MW/$5B/yr deal for Colossus I, ARR growth is 8000% annualized
    claude claude-code opus colossus-1 anthropic spacex x-ai compute rate-limiting agent-platforms inference api managed-agents safety governance event nottombrown _aidan_clark_ kipperrii theamolavasare alexalbert__
    Anthropic announced a new SpaceX compute partnership to significantly increase capacity for Claude products, doubling Claude Code's 5-hour rate limits for Pro, Max, Team, and Enterprise users, removing peak-hour limit reductions, and substantially increasing API rate limits for Opus models. The deal grants Anthropic access to Colossus 1 via SpaceXAI, with Claude inference expected to ramp up on Colossus soon. Anthropic also hosted a "Code with Claude" event featuring updates on Claude Code, GitHub-scale usage, and managed agents. Discussions highlighted compute bottlenecks, user reactions to limit changes, debates on managed-agent features, and ongoing safety/governance discourse around AGI trustworthiness.
See all issues

Let's Connect

If you want to get in touch with me about something or just to say hi, reach out on social media or send me an email.

  • GitHub /
  • X (@smol_ai) /
  • swyx at smol dot ai
© 2026 • AINews
You can also subscribe by rss .
Press Esc or click anywhere to close