All tags
Topic: "inference"
not much happened today
nemotron-nano-2 gpt-oss-120b qwen3 llama-3 minimax-m2 glm-4.6-air gemini-2.5-flash gpt-5.1-mini tahoe-x1 vllm_project nvidia mistral-ai baseten huggingface thinking-machines deeplearningai pytorch arena yupp-ai zhipu-ai scaling01 stanford transformer-architecture model-optimization inference distributed-training multi-gpu-support performance-optimization agents observability model-evaluation reinforcement-learning model-provenance statistical-testing foundation-models cancer-biology model-fine-tuning swyx dvilasuero _lewtun clementdelangue zephyr_z9 skylermiao7 teortaxestex nalidoust
vLLM announced support for NVIDIA Nemotron Nano 2, featuring a hybrid Transformer–Mamba design and tunable "thinking budget" enabling up to 6× faster token generation. Mistral AI Studio launched a production platform for agents with deep observability. Baseten reported high throughput (650 TPS) for GPT-OSS 120B on NVIDIA hardware. Hugging Face InspectAI added inference provider integration for cross-provider evaluation. Thinking Machines Tinker abstracts distributed fine-tuning for open-weight LLMs like Qwen3 and Llama 3. In China, MiniMax M2 shows competitive performance with top models and is optimized for agents and coding, while Zhipu GLM-4.6-Air focuses on reliability and scaling for coding tasks. Rumors suggest Gemini 2.5 Flash may be a >500B parameter MoE model, and a possible GPT-5.1 mini reference appeared. Outside LLMs, Tahoe-x1 (3B) foundation model achieved SOTA in cancer cell biology benchmarks. Research from Stanford introduces a method to detect model provenance via training-order "palimpsest" with strong statistical guarantees.
OpenAI Titan XPU: 10GW of self-designed chips with Broadcom
llama-3-70b openai nvidia amd broadcom inferencemax asic inference compute-infrastructure chip-design fp8 reinforcement-learning ambient-agents custom-accelerators energy-consumption podcast gdb
OpenAI is finalizing a custom ASIC chip design to deploy 10GW of inference compute, complementing existing deals with NVIDIA (10GW) and AMD (6GW). This marks a significant scale-up from OpenAI's current 2GW compute, aiming for a roadmap of 250GW total, which is half the energy consumption of the US. Greg from OpenAI highlights the shift of ChatGPT from interactive use to always-on ambient agents requiring massive compute, emphasizing the challenge of building chips for billions of users. The in-house ASIC effort was driven by the need for tailored designs after limited success influencing external chip startups. Broadcom's stock surged 10% on the news. Additionally, InferenceMAX reports improved ROCm stability and nuanced performance comparisons between AMD MI300X and NVIDIA H100/H200 on llama-3-70b FP8 workloads, with RL training infrastructure updates noted.
not much happened today
gpt-5-pro gemini-2.5 vllm deepseek-v3.1 openai google-deepmind microsoft epoch-ai-research togethercompute nvidia mila reasoning reinforcement-learning inference speculative-decoding sparse-attention kv-cache-management throughput-optimization compute-efficiency tokenization epochairesearch yitayml _philschmid jiqizhixin cvenhoff00 neelnanda5 lateinteraction mgoin_ blackhc teortaxestex
FrontierMath Tier 4 results show GPT-5 Pro narrowly outperforming Gemini 2.5 Deep Think in reasoning accuracy, with concerns about problem leakage clarified by Epoch AI Research. Mila and Microsoft propose Markovian Thinking to improve reasoning efficiency, enabling models to reason over 24K tokens with less compute. New research suggests base models inherently contain reasoning mechanisms, with "thinking models" learning to invoke them effectively. In systems, NVIDIA Blackwell combined with vLLM wins InferenceMAX with significant throughput gains, while Together AI's ATLAS adaptive speculative decoding achieves 4× speed improvements and reduces RL training time by over 60%. SparseServe introduces dynamic sparse attention with KV tiering, drastically improving throughput and latency in GPU memory management.
not much happened today
gpt-5-codex vllm-0.10.2 qwen3-next-80b hunyuanimage-2.1 openai microsoft perplexity-ai huggingface amd tencent lmstudio agentic-ai ide context-windows inference distributed-inference reinforcement-learning robotics long-context model-optimization text-to-image multimodality model-licenses gdb teknium1 finbarrtimbers thsottiaux theturingpost pierceboggan amandaksilver aravsrinivas sergiopaniego art_zucker danielhanchen rwojo awnihannun
GPT-5 Codex rollout shows strong agentic coding capabilities with some token bloat issues. IDEs like VS Code Insiders and Cursor 1.6 enhance context windows and model integration. vLLM 0.10.2 supports aarch64 and NVIDIA GB200 with performance improvements. AMD ROCm updates add modern attention, sparse MoE, and distributed inference. TRL introduces Context Parallelism for long-context training. Robotics and RL data pipelines improve with Unsloth and LeRobotDataset v3. Qwen3-Next-80B runs efficiently on Mac M4 Max with MLX. Tencent's HunyuanImage 2.1 is a 17B bilingual text-to-image model with 2048×2048 resolution and restricted open weights.
not much happened today
chai-2 gemini-2.5-pro deepseek-r1-0528 meta scale-ai anthropic cloudflare grammarly superhuman chai-discovery atlassian notion slack commoncrawl hugging-face sakana-ai inference model-scaling collective-intelligence zero-shot-learning enterprise-deployment data-access science-funding open-source-llms alexandr_wang nat_friedman clementdelangue teortaxestex ylecun steph_palazzolo andersonbcdefg jeremyphoward reach_vb
Meta makes a major AI move by hiring Scale AI founder Alexandr Wang as Chief AI Officer and acquiring a 49% non-voting stake in Scale AI for $14.3 billion, doubling its valuation to about $28 billion. Chai Discovery announces Chai-2, a breakthrough model for zero-shot antibody discovery and optimization. The US government faces budget cuts threatening to eliminate a quarter million science research jobs by 2026. Data access restrictions intensify as companies like Atlassian, Notion, and Slack block web crawlers including Common Crawl, raising concerns about future public internet archives. Hugging Face shuts down HuggingChat after serving over a million users, marking a significant experiment in open-source LLMs. Sakana AI releases AB-MCTS, an inference-time scaling algorithm enabling multiple models like Gemini 2.5 Pro and DeepSeek-R1-0528 to cooperate and outperform individual models.
not much happened today
dots-llm1 qwen3-235b xiaohongshu rednote-hilab deepseek huggingface mixture-of-experts open-source model-benchmarking fine-tuning inference context-windows training-data model-architecture model-performance model-optimization
China's Xiaohongshu (Rednote) released dots.llm1, a 142B parameter open-source Mixture-of-Experts (MoE) language model with 14B active parameters and a 32K context window, pretrained on 11.2 trillion high-quality, non-synthetic tokens. The model supports efficient inference frameworks like Docker, HuggingFace, and vLLM, and provides intermediate checkpoints every 1 trillion tokens, enabling flexible fine-tuning. Benchmarking claims it slightly surpasses Qwen3 235B on MMLU, though some concerns exist about benchmark selection and synthetic data verification. The release is notable for its truly open-source licensing and no synthetic data usage, sparking community optimism for support in frameworks such as llama.cpp and mlx.
Mary Meeker is so back: BOND Capital AI Trends report
qwen-3-8b anthropic hugging-face deepseek attention-mechanisms inference arithmetic-intensity transformers model-optimization interpretability model-quantization training tri_dao fleetwood___ teortaxestex awnihannun lateinteraction neelnanda5 eliebakouch _akhaliq
Mary Meeker returns with a comprehensive 340-slide report on the state of AI, highlighting accelerating tech cycles, compute growth, and comparisons of ChatGPT to early Google and other iconic tech products. The report also covers enterprise traction and valuation of major AI companies. On Twitter, @tri_dao discusses an "ideal" inference architecture featuring attention variants like GTA, GLA, and DeepSeek MLA with high arithmetic intensity (~256), improving efficiency and model quality. Other highlights include the release of 4-bit DWQ of DSR1 Qwen3 8B on Hugging Face, AnthropicAI's open-source interpretability tools for LLMs, and discussions on transformer training and abstractions by various researchers.
Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1
qwen-3 qwen3-235b-a22b qwen3-30b-a3b deepseek-r1 o1 o3-mini grok-3 gemini-2.5-pro alibaba google-deepmind deepseek mistral-ai mixture-of-experts reinforcement-learning benchmarking model-release model-architecture long-context multi-agent-systems inference dataset-release awnihannun prince_canuma actuallyisaak oriolvinyalsml iscienceluvr reach_vb teortaxestex omarsar0
Qwen 3 has been released by Alibaba featuring a range of models including two MoE variants, Qwen3-235B-A22B and Qwen3-30B-A3B, which demonstrate competitive performance against top models like DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro. The models introduce an "enable_thinking=True" mode with advanced soft switching for inference scaling. The release is notable for its Apache 2.0 license and broad inference platform support including MCP. The dataset improvements and multi-stage RL post-training contribute to performance gains. Meanwhile, Gemini 2.5 Pro from Google DeepMind shows strong coding and long-context reasoning capabilities, and DeepSeek R2 is anticipated soon. Twitter discussions highlight Qwen3's finegrained MoE architecture, large context window, and multi-agent system applications.
OpenAI o3, o4-mini, and Codex CLI
o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01
OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.
not much happened today
gpt-4.1 o3 o4-mini grok-3 grok-3-mini o1 tpuv7 gb200 openai x-ai google nvidia samsung memory model-release hardware-accelerators fp8 hbm inference ai-conferences agent-collaboration robotics model-comparison performance power-consumption sama
OpenAI teased a Memory update in ChatGPT with limited technical details. Evidence suggests upcoming releases of o3 and o4-mini models, alongside a press leak about GPT-4.1. X.ai launched the Grok 3 and Grok 3 mini APIs, confirmed as o1 level models. Discussions compared Google's TPUv7 with Nvidia's GB200, highlighting TPUv7's specs like 4,614 TFLOP/s FP8 performance, 192 GB HBM, and 1.2 Tbps ICI bandwidth. TPUv7 may have pivoted from training to inference chip use. Key AI events include Google Cloud Next 2025 and Samsung's Gemini-powered Ballie robot. The community is invited to participate in the AI Engineer World's Fair 2025 and the 2025 State of AI Engineering survey.
lots of small launches
gpt-4o claude-3.7-sonnet claude-3.7 claude-3.5-sonnet deepseek-r1 deepseek-v3 grok-3 openai anthropic amazon cloudflare perplexity-ai deepseek-ai togethercompute elevenlabs elicitorg inceptionailabs mistral-ai voice model-releases cuda gpu-optimization inference open-source api model-performance token-efficiency context-windows cuda jit-compilation lmarena_ai alexalbert__ aravsrinivas reach_vb
GPT-4o Advanced Voice Preview is now available for free ChatGPT users with enhanced daily limits for Plus and Pro users. Claude 3.7 Sonnet has achieved the top rank in WebDev Arena with improved token efficiency. DeepSeek-R1 with 671B parameters benefits from the Together Inference platform optimizing NVIDIA Blackwell GPU usage, alongside the open-source DeepGEMM CUDA library delivering up to 2.7x speedups on Hopper GPUs. Perplexity launched a new Voice Mode and a Deep Research API. The upcoming Grok 3 API will support a 1M token context window. Several companies including Elicit, Amazon, Anthropic, Cloudflare, FLORA, Elevenlabs, and Inception Labs announced new funding rounds, product launches, and model releases.
How To Scale Your Model, by DeepMind
qwen-0.5 google-deepmind deepseek hugging-face transformers inference high-performance-computing robotics sim2real mixture-of-experts reinforcement-learning bias-mitigation rust text-generation open-source omarsar0 drjimfan tairanhe99 guanyashi lioronai _philschmid awnihannun clementdelangue
Researchers at Google DeepMind (GDM) released a comprehensive "little textbook" titled "How To Scale Your Model" covering modern Transformer architectures, inference optimizations beyond O(N^2) attention, and high-performance computing concepts like rooflines. The resource includes practical problems and real-time comment engagement. On AI Twitter, several key updates include the open-sourced humanoid robotics model ASAP inspired by athletes like Cristiano Ronaldo, LeBron James, and Kobe Bryant; a new paper on Mixture-of-Agents proposing the Self-MoA method for improved LLM output aggregation; training of reasoning LLMs using the GRPO algorithm from DeepSeek demonstrated on Qwen 0.5; findings on bias in LLMs used as judges highlighting the need for multiple independent evaluations; and the release of mlx-rs, a Rust library for machine learning with examples including Mistral text generation. Additionally, Hugging Face launched an AI app store featuring over 400,000 apps with 2,000 new daily additions and 2.5 million weekly visits, enabling AI-powered app search and categorization.
Qwen with Questions: 32B open weights reasoning model nears o1 in GPQA/AIME/Math500
deepseek-r1 qwq gpt-4o claude-3.5-sonnet qwen-2.5 llama-cpp deepseek sambanova hugging-face dair-ai model-releases benchmarking fine-tuning sequential-search inference model-deployment agentic-rag external-tools multi-modal-models justin-lin clementdelangue ggerganov vikparuchuri
DeepSeek r1 leads the race for "open o1" models but has yet to release weights, while Justin Lin released QwQ, a 32B open weight model that outperforms GPT-4o and Claude 3.5 Sonnet on benchmarks. QwQ appears to be a fine-tuned version of Qwen 2.5, emphasizing sequential search and reflection for complex problem-solving. SambaNova promotes its RDUs as superior to GPUs for inference tasks, highlighting the shift from training to inference in AI systems. On Twitter, Hugging Face announced CPU deployment for llama.cpp instances, Marker v1 was released as a faster and more accurate deployment tool, and Agentic RAG developments focus on integrating external tools and advanced LLM chains for improved response accuracy. The open-source AI community sees growing momentum with models like Flux gaining popularity, reflecting a shift towards multi-modal AI models including image, video, audio, and biology.
Perplexity starts Shopping for you
pixtral-large-124b llama-3.1-405b claude-3.6 claude-3.5 stripe perplexity-ai mistral-ai hugging-face cerebras anthropic weights-biases google vllm-project multi-modal image-generation inference context-windows model-performance model-efficiency sdk ai-integration one-click-checkout memory-optimization patrick-collison jeff-weinstein mervenoyann sophiamyang tim-dettmers omarsar0 akhaliq aravsrinivas
Stripe launched their Agent SDK, enabling AI-native shopping experiences like Perplexity Shopping for US Pro members, featuring one-click checkout and free shipping via the Perplexity Merchant Program. Mistral AI released the Pixtral Large 124B multi-modal image model, now on Hugging Face and supported by Le Chat for image generation. Cerebras Systems offers a public inference endpoint for Llama 3.1 405B with a 128k context window and high throughput. Claude 3.6 shows improvements over Claude 3.5 but with subtle hallucinations. The Bi-Mamba 1-bit architecture improves LLM efficiency. The wandb SDK is preinstalled on Google Colab, and Pixtral Large is integrated into AnyChat and supported by vLLM for efficient model usage.
Pixtral Large (124B) beats Llama 3.2 90B with updated Mistral Large 24.11
pixtral-large mistral-large-24.11 llama-3-2 qwen2.5-7b-instruct-abliterated-v2-gguf qwen2.5-32b-q3_k_m vllm llama-cpp exllamav2 tabbyapi mistral-ai sambanova nvidia multimodality vision model-updates chatbots inference gpu-optimization quantization performance concurrency kv-cache arthur-mensch
Mistral has updated its Pixtral Large vision encoder to 1B parameters and released an update to the 123B parameter Mistral Large 24.11 model, though the update lacks major new features. Pixtral Large outperforms Llama 3.2 90B on multimodal benchmarks despite having a smaller vision adapter. Mistral's Le Chat chatbot received comprehensive feature updates, reflecting a company focus on product and research balance as noted by Arthur Mensch. SambaNova sponsors inference with their RDUs offering faster AI model processing than GPUs. On Reddit, vLLM shows strong concurrency performance on an RTX 3090 GPU, with quantization challenges noted in FP8 kv-cache but better results using llama.cpp with Q8 kv-cache. Users discuss performance trade-offs between vLLM, exllamav2, and TabbyAPI for different model sizes and batching strategies.
o1: OpenAI's new general reasoning models
o1 o1-preview o1-mini gpt-4o llama openai nvidia test-time-reasoning reasoning-tokens token-limit competitive-programming benchmarking scaling-laws ai-chip-competition inference training model-performance jason-wei jim-fan
OpenAI has released the o1 model family, including o1-preview and o1-mini, focusing on test-time reasoning with extended output token limits over 30k tokens. The models show strong performance, ranking in the 89th percentile on competitive programming, excelling in USA Math Olympiad qualifiers, and surpassing PhD-level accuracy on physics, biology, and chemistry benchmarks. Notably, o1-mini performs impressively despite its smaller size compared to gpt-4o. The release highlights new scaling laws for test-time compute that scale loglinearly. Additionally, Nvidia is reportedly losing AI chip market share to startups, with a shift in developer preference from CUDA to llama models for web development, though Nvidia remains dominant in training. This news reflects significant advances in reasoning-focused models and shifts in AI hardware competition.
OpenAI's Instruction Hierarchy for the LLM OS
phi-3-mini openelm claude-3-opus gpt-4-turbo gpt-3.5-turbo llama-3-70b rho-1 mistral-7b llama-3-8b llama-3 openai microsoft apple deepseek mistral-ai llamaindex wendys prompt-injection alignment benchmarking instruction-following context-windows model-training model-deployment inference performance-optimization ai-application career-advice drive-thru-ai
OpenAI published a paper introducing the concept of privilege levels for LLMs to address prompt injection vulnerabilities, improving defenses by 20-30%. Microsoft released the lightweight Phi-3-mini model with 4K and 128K context lengths. Apple open-sourced the OpenELM language model family with an open training and inference framework. An instruction accuracy benchmark compared 12 models, with Claude 3 Opus, GPT-4 Turbo, and Llama 3 70B performing best. The Rho-1 method enables training state-of-the-art models using only 3% of tokens, boosting models like Mistral. Wendy's deployed AI-powered drive-thru ordering, and a study found Gen Z workers prefer generative AI for career advice. Tutorials on deploying Llama 3 models on AWS EC2 highlight hardware requirements and inference server use.
Google Solves Text to Video
mistral-7b llava google-research amazon-science huggingface mistral-ai together-ai text-to-video inpainting space-time-diffusion code-evaluation fine-tuning inference gpu-rentals multimodality api model-integration learning-rates
Google Research introduced Lumiere, a text-to-video model featuring advanced inpainting capabilities using a Space-Time diffusion process, surpassing previous models like Pika and Runway. Manveer from UseScholar.org compiled a comprehensive list of code evaluation benchmarks beyond HumanEval, including datasets from Amazon Science, Hugging Face, and others. Discord communities such as TheBloke discussed topics including running Mistral-7B via API, GPU rentals, and multimodal model integration with LLava. Nous Research AI highlighted learning rate strategies for LLM fine-tuning, issues with inference, and benchmarks like HumanEval and MBPP. RestGPT gained attention for controlling applications via RESTful APIs, showcasing LLM application capabilities.