All tags
Person: "corbtt"
ChatGPT Atlas: OpenAI's AI Browser
gemini atlas openai google langchain ivp capitalg sapphire sequoia benchmark agent-mode browser-memory chromium finetuning moe lora agent-runtime observability software-development funding kevinweil bengoodger fidjissimo omarsar0 yuchenj_uw nickaturley raizamrtn hwchase17 bromann casper_hansen_ corbtt
OpenAI launched the Chromium fork AI browser Atlas for macOS, featuring integrated Agent mode and browser memory with local login capabilities, aiming to surpass Google's Gemini in Chrome. The launch received mixed reactions regarding reliability and privacy. LangChain raised a $125M Series B at a $1.25B valuation, releasing v1.0 agent engineering stack with significant adoption including 85M+ OSS downloads/month and usage by ~35% of the Fortune 500. The ecosystem also saw updates like vLLM's MoE LoRA expert finetuning support.
not much happened today
7m-tiny-recursive-model jamba-reasoning-3b qwen3-omni qwen-image-edit-2509 colbert-nano agentflow samsung lecuun ai21-labs alibaba coreweave weights-biases openpipe stanford recursive-reasoning density-estimation multimodality long-context retrieval serverless-reinforcement-learning agentic-systems model-efficiency reinforcement-learning transformers rasbt jm_alexia jiqizhixin randall_balestr corbtt shawnup _akhaliq
Samsung's 7M Tiny Recursive Model (TRM) achieves superior reasoning on ARC-AGI and Sudoku with fewer layers and MLP replacing self-attention. LeCun's team introduces JEPA-SCORE, enabling density estimation from encoders without retraining. AI21 Labs releases Jamba Reasoning 3B, a fast hybrid SSM-Transformer model supporting up to 64K context tokens. Alibaba's Qwen3 Omni/Omni Realtime offers a unified audio-video-text model with extensive language and speech support, outperforming Gemini 2.0 Flash on BigBench Audio. Alibaba also debuts Qwen Image Edit 2509, a top open-weight multi-image editing model. ColBERT Nano models demonstrate effective retrieval at micro-scale parameter sizes. In reinforcement learning, CoreWeave, Weights & Biases, and OpenPipe launch serverless RL infrastructure reducing costs and speeding training. Stanford's AgentFlow presents an in-the-flow RL system with a 7B backbone outperforming larger models on agentic tasks. This update highlights advances in recursive reasoning, density estimation, multimodal architectures, long-context modeling, retrieval, and serverless reinforcement learning.
Anthropic raises $13B at $183B Series F
claude-code gpt-5 grok-4 claude sonnet-4 glm-4.5 deepseek-r1 anthropic mistral-ai x-ai salesforce galileo openpipe zhipu thudm enterprise-connectors agent-benchmarking reinforcement-learning inference-optimization memory-optimization cuda multi-token-prediction speculative-decoding tensor-offload performance-optimization real-time-guardrails cost-optimization swyx emilygsands _philschmid _lewtun omarsar0 _avichawla corbtt
Anthropic achieved a $183B post-money valuation in Series F funding by September 2025, growing from about $1B run-rate in January to over $5B run-rate by August 2025. Their Claude Code product saw >10x usage growth in three months and reached $500M run-rate revenue, serving over 300,000 business customers with a nearly 7x increase in large accounts. Mistral AI launched Le Chat with 20+ MCP connectors integrating with major SaaS platforms and persistent memory features. Benchmarking updates highlight GPT-5 leading agent intelligence indices, with strong performances from xAI's Grok and Anthropic's Claude families. Reliability tooling and agent evaluation advances were shared by Galileo, OpenPipe, and others. Zhipu/THUDM open-sourced Slime v0.1.0, enhancing RL infrastructure behind GLM-4.5 with significant decoding speed improvements and advanced tensor offload techniques.
not much happened today
grok-2 grok-2.5 vibevoice-1.5b motif-2.6b gpt-5 qwen-code xai-org microsoft motif-technology alibaba huggingface langchain-ai mixture-of-experts model-scaling model-architecture text-to-speech fine-tuning training-data optimization reinforcement-learning agentic-ai tool-use model-training model-release api software-development model-quantization elonmusk clementdelangue rasbt quanquangu akhaliq eliebakouch gdb ericmitchellai ivanfioravanti deanwball giffmana omarsar0 corbtt
xAI released open weights for Grok-2 and Grok-2.5 with a novel MoE residual architecture and μP scaling, sparking community excitement and licensing concerns. Microsoft open-sourced VibeVoice-1.5B, a multi-speaker long-form TTS model with streaming support and a 7B variant forthcoming. Motif Technology published a detailed report on Motif-2.6B, highlighting Differential Attention, PolyNorm, and extensive finetuning, trained on AMD MI250 GPUs. In coding tools, momentum builds around GPT-5-backed workflows, with developers favoring it over Claude Code. Alibaba released Qwen-Code v0.0.8 with deep VS Code integration and MCP CLI enhancements. The MCP ecosystem advances with LiveMCP-101 stress tests, the universal MCP server "Rube," and LangGraph Platform's rollout of revision queueing and ART integration for RL training of agents.
Qwen-Image: SOTA text rendering + 4o-imagegen-level Editing Open Weights MMDiT
qwen-image mmdit gemini-2.5 o3-pro seedprover glm-4.5 xbai-o4 hunyuan alibaba google-deepmind openai bytedance kaggle tencent bilingual-text-rendering image-generation image-editing synthetic-data reasoning math-theorem-proving benchmarking instruction-following model-efficiency open-weight-models model-transparency competitive-evaluation swyx demishassabis tulseedoshi mparakhin teortaxestex cgeorgiaw dorialexander steph_palazzolo corbtt synthwavedd epochairesearch
Alibaba surprised with the release of Qwen-Image, a 20B MMDiT model excelling at bilingual text rendering and graphic poster creation, with open weights and demos available. Google DeepMind launched Gemini 2.5 Deep Think to Ultra subscribers, showing significant reasoning improvements and benchmark gains (+11.2% AIME, +13.2% HLE, +13.4% LiveCodeBench) rivaling OpenAI's o3 Pro. ByteDance's SeedProver achieved state-of-the-art math theorem proving results, surpassing DeepMind's AlphaGeometry2. OpenAI is developing a "universal verifier" for math and coding gains transfer. Competitive reasoning benchmarks and game arenas by Google and Kaggle highlight a meta-shift in reasoning model efficiency, comparable to the original Transformer leap. Other open-weight models gaining momentum include GLM-4.5, XBai o4, and Tencent Hunyuan with a focus on efficient training. "Qwen is all you need."
Figma's $50+b IPO
horizon-alpha gpt-5 gemini-2.5-pro qwen3-coder qwen3-coder-flash-30b-a3b command-a-vision gpt-4.1 llama-4-maverick flux-1-krea-dev glm-4.5 voxtral openai openrouter alibaba unslothai cohere huggingface black-forest-labs diffusers ostrisai zhipu-ai together-ai mistral-ai reasoning svg-generation agentic-ai context-windows vision fine-tuning inference-time-training model-generalization open-models technical-reports scaling01 teortaxestex huybery nickfrosst aidangomez reach_vb zai_org corbtt jxmnop teknuim1
OpenAI's stealth model horizon-alpha on OpenRouter sparks speculation as a precursor to GPT-5, showing strong reasoning and SVG generation capabilities, comparable to Gemini 2.5 Pro. Alibaba released the Qwen3-Coder family, including a fast Qwen3-Coder-Flash (30B-A3B) variant with agentic features and 1M context length support via UnslothAI. Cohere launched Command A Vision, a 111B parameter open-weights vision-language model outperforming GPT-4.1 and Llama 4 Maverick on enterprise benchmarks. Black Forest Labs introduced FLUX.1 Krea [dev], an open-weights photorealism model compatible with fine-tuning tools like diffusers and ostrisai. Zhipu AI unveiled GLM-4.5, a hybrid reasoning open model with agentic capabilities available on Together AI. Discussions highlight the rising importance of inference-time training and reasoning model generalization. Mistral AI released the technical report for Voxtral continuing its open science efforts.
not much happened today
glm-4.5 glm-4.5-air qwen3-coder qwen3-235b kimi-k2 grok-imagine wan-2.2 smollm3 figure-01 figure-02 vitpose++ chatgpt zhipu-ai alibaba moonshot-ai x-ai figure openai runway mlx ollama deeplearningai model-releases model-performance moe image-generation video-generation pose-estimation robotics training-code-release interactive-learning in-context-learning yuchenj_uw corbtt reach_vb ollama deeplearningai gdb sama c_valenzuelab adcock_brett skalskip92 loubnabenallal1 hojonathanho ostrisai
Chinese AI labs have released powerful open-source models like GLM-4.5 and GLM-4.5-Air from Zhipu AI, Qwen3 Coder and Qwen3-235B from Alibaba, and Kimi K2 from Moonshot AI, highlighting a surge in permissively licensed models. Zhipu AI's GLM-4.5 is a 355B parameter MoE model competitive with Claude 4 Opus and Gemini 2.5 Pro. Alibaba's Qwen3 Coder shows strong code generation performance with a low edit failure rate, while Moonshot AI's Kimi K2 is a 1 trillion-parameter MoE model surpassing benchmarks like LiveCodeBench. In video and image generation, xAI launched Grok Imagine, and Wan2.2 impressed with innovative image-to-video generation. Robotics advances include Figure's Figure-01 and Figure-02 humanoid robots and ViTPose++ for pose estimation in basketball analysis. SmolLM3 training and evaluation code was fully released under Apache 2.0. OpenAI introduced Study Mode in ChatGPT to enhance interactive learning, and Runway rolled out Runway Aleph, a new in-context video model for multi-task visual generation. The community notes a competitive disadvantage for organizations avoiding these Chinese open-source models. "Orgs avoiding these models are at a significant competitive disadvantage," noted by @corbtt.
not much happened today
glm-4.5 glm-4.5-air qwen3-coder qwen3-235b kimi-k2 wan-2.2 grok-imagine smollm3 figure-01 figure-02 vitpose++ zhipu-ai alibaba moonshot-ai x-ai ideogram figure smollm openai model-releases moe model-benchmarking image-generation video-generation pose-estimation robotics training-code-release apache-license yuchenj_uw corbtt cline reach_vb ollama deeplearningai ostrisai hojonathanho adcock_brett skalskip92 loubnabenallal1
Chinese labs have released a wave of powerful, permissively licensed models in July, including Zhipu AI's GLM-4.5 and GLM-4.5-Air, Alibaba's Qwen3 Coder and Qwen3-235B, and Moonshot AI's Kimi K2. These models feature large-scale Mixture of Experts architectures with active parameters ranging from 3B to 32B and context windows up to 256K tokens. Zhipu AI's GLM-4.5 competes with Claude 4 Opus and Gemini 2.5 Pro in benchmarks. Moonshot AI's Kimi K2 is a 1 trillion-parameter MoE model surpassing other open-weight models on LiveCodeBench and AceBench. In video and image generation, xAI launched Grok Imagine, and Wan2.2 impressed with its Image-to-Video approach. Ideogram released a character consistency model. Robotics advances include Figure's Figure-01 and Figure-02 humanoid robots and ViTPose++ for pose estimation in basketball analysis. The SmolLM3 training and evaluation code was fully released under an Apache 2.0 license. "Orgs avoiding these Chinese open-source models are at a significant competitive disadvantage," noted by @corbtt.
not much happened today
grok-4 jamba ernie-4.5 claude-4-sonnet claude-4 kontext-dev ai21-labs hugging-face baidu perplexity-ai deepmind anthropic reinforcement-learning fine-tuning energy-based-transformers ssm-transformer context-windows length-generalization recurrent-neural-networks attention-mechanisms 2-simplicial-attention biomedical-ai instruction-following open-weight-models python-package-management _philschmid corbtt jxmnop sedielem _akhaliq slashml alexiglad clementdelangue _albertgu tri_dao theaitimeline deep-learning-ai
Over the holiday weekend, key AI developments include the upcoming release of Grok 4, Perplexity teasing new projects, and community reactions to Cursor and Dia. Research highlights feature a paper on Reinforcement Learning (RL) improving generalization and reasoning across domains, contrasting with Supervised Fine-Tuning's forgetting issues. Energy-Based Transformers (EBTs) are proposed as a promising alternative to traditional transformers. AI21 Labs updated its Jamba model family with enhanced grounding and instruction following, maintaining a 256K context window. Baidu open-sourced its massive 424 billion parameter Ernie 4.5 model, while Kontext-dev became the top trending model on Hugging Face. Advances in length generalization for recurrent models and the introduction of 2-simplicial attention were noted. In biomedical AI, Biomni, powered by Claude 4 Sonnet, demonstrated superior accuracy and rare disease diagnosis capabilities. Additionally, the Python package manager
uv received praise for improving Python installation workflows. Bartz v. Anthropic PBC — "Training use is Fair Use"
claude gemini-robotics-on-device anthropic replit delphi sequoia thinking-machines-lab disney universal midjourney google-deepmind fair-use copyright reinforcement-learning foundation-models robotics funding lawsuit digital-minds model-release andrea_bartz giffmana andrewcurran_ amasad swyx hwchase17 krandiash daraladje steph_palazzolo corbtt demishassabis
Anthropic won a significant fair use ruling allowing the training of Claude on copyrighted books, setting a precedent for AI training legality despite concerns over pirated data. Replit achieved a major milestone with $100M ARR, showing rapid growth. Delphi raised $16M Series A to scale digital minds, while Thinking Machines Lab focuses on reinforcement learning for business applications. Disney and Universal sued Midjourney over unauthorized use of copyrighted images. Google DeepMind released Gemini Robotics On-Device, a compact foundation model for robotics.
not much happened to end the year
deepseek-v3 code-llm o1 sonnet-3.5 deepseek smol-ai reinforcement-learning reasoning training-data mixed-precision-training open-source multimodality software-development natural-language-processing interpretability developer-tools real-time-applications search sdk-generation corbtt tom_doerr cognitivecompai alexalbert__ theturingpost svpino bindureddy
Reinforcement Fine-Tuning (RFT) is introduced as a data-efficient method to improve reasoning in LLMs using minimal training data with strategies like First-Correct Solutions (FCS) and Greedily Diverse Solutions (GDS). DeepSeek-V3, a 671B parameter MoE language model trained on 14.8 trillion tokens with FP8 mixed precision training, highlights advances in large-scale models and open-source LLMs. Predictions for AI in 2025 include growth in smaller models, multimodality, and challenges in open-source AI. The impact of AI on software development jobs suggests a need for higher intelligence and specialization as AI automates low-skilled tasks. Enhancements to CodeLLM improve coding assistance with features like in-place editing and streaming responses. Natural Language Reinforcement Learning (NLRL) offers better interpretability and richer feedback for AI planning and critique. AI hiring is growing rapidly with startups seeking strong engineers in ML and systems. New AI-powered tools such as Rivet, Buzee, and Konfig improve real-time applications, search, and SDK generation using technologies like Rust and V8 isolates.
not much happened today + AINews Podcast?
superforecaster-ai llama-3 reflection-70b glean sambanova cerebras stanford google apple hugging-face lmsys prompt-engineering research-ideas inference-speed retrieval-augmented-generation evaluation-methods visual-intelligence on-device-ai model-performance benchmarking novelty-detection danhendrycks benjamin-clavie bclavie bindureddy swyx borismpower corbtt drjimfan clementdelangue rohanpaul_ai
Glean doubled its valuation again. Dan Hendrycks' Superforecaster AI generates plausible election forecasts with interesting prompt engineering. A Stanford study found that LLM-generated research ideas are statistically more novel than those by expert humans. SambaNova announced faster inference for llama-3 models, surpassing Cerebras. Benjamin Clavie gave a notable talk on retrieval-augmented generation techniques. Strawberry is reported to launch in two weeks. Google Illuminate offers AI-generated podcast discussions about papers and books. Apple unveiled new AI features in iOS 18, including visual intelligence and improved Siri, with on-device and cloud processing for camera-based event additions. The Reflection 70B model sparked controversy over performance claims. Experts highlighted the unreliability of traditional benchmarks like MMLU and HumanEval, recommending alternative evaluation methods such as LMSys Chatbot Arena and Hugging Face's open-sourced Lighteval suite. The AI research community continues to explore AI's role in generating novel research ideas and improving benchmarking.
not much happened today
gpt-4-0613 gpt-3.5-turbo-0613 gpt-4o-2024-08-06 mistral-large-2 gpt4-turbo claude-3-opus idefics3-llama bigllama-3.1-1t-instruct llama-3-120b-instruct openai mistral-ai meta-ai-fair structured-outputs function-calling json-schema benchmarking multimodality context-windows model-scaling ai-hardware vision speech-processing robotics ai-regulation sama rohanpaul_ai corbtt guillaumelample mervenoyann maximelabonne aidan_mclau adcock_brett ylecun
OpenAI introduced structured outputs in their API with a new "strict" mode and a "response_format" parameter, supporting models like gpt-4-0613, gpt-3.5-turbo-0613, and the new gpt-4o-2024-08-06. They also halved the price of gpt-4o to $2.50 per million tokens. Mistral Large 2 outperforms gpt4-turbo and claude-3-opus on hard benchmarks and coding tasks. Idefics3-Llama offers multimodal capabilities with a 10k token context window. BigLlama-3.1-1T-Instruct is an upscaled version of llama-3-120b-instruct. New benchmark "big_model_smell" measures creativity and reliability. Figure 02 robot features advanced AI hardware with onboard vision language model, enhanced battery, and speech-to-speech reasoning. Yann LeCun expressed concerns about California's SB1047 regulation.
Hybrid SSM/Transformers > Pure SSMs/Pure Transformers
mamba-2-hybrid gpt-4 qwen-72b table-llava-7b nvidia lamini-ai sakana-ai luma-labs mixture-of-experts benchmarking fine-tuning multimodality text-to-video model-performance memory-optimization preference-optimization video-understanding multimodal-tables bryan-catanzaro bindureddy ylecun ctnzr corbtt realsharonzhou andrew-n-carr karpathy _akhaliq omarsar0
NVIDIA's Bryan Catanzaro highlights a new paper on Mamba models, showing that mixing Mamba and Transformer blocks outperforms either alone, with optimal attention below 20%. Mixture-of-Agents (MoA) architecture improves LLM generation quality, scoring 65.1% on AlpacaEval 2.0 versus GPT-4 Omni's 57.5%. The LiveBench AI benchmark evaluates reasoning, coding, writing, and data analysis. A hybrid Mamba-2-Hybrid model with 7% attention surpasses a Transformer on MMLU accuracy, jumping from 50% to 53.6%. GPT-4 performs better at temperature=1. Qwen 72B leads open-source models on LiveBench AI. LaminiAI Memory Tuning achieves 95% accuracy on a SQL agent task, improving over instruction fine-tuning. Sakana AI Lab uses evolutionary strategies for preference optimization. Luma Labs Dream Machine demonstrates advanced text-to-video generation. The MMWorld benchmark evaluates multimodal video understanding, and Table-LLaVa 7B competes with GPT-4V on multimodal table tasks.