Company: "baseten"

kimi-k3 grok-4.5 chatgpt codex moonshot baseten nvidia red-hat-ai perplexity-ai togethercompute cursor_ai mixture-of-experts model-architecture attention-mechanisms reinforcement-learning infrastructure model-deployment agentic-ai mobile-ai multimodality model-distillation gpu-optimization system-design zhihufrontier rasbt bhavinjawade danizeres amansanger

Moonshot released the Kimi K3, a 2.8T-parameter MoE model with 104B active parameters/token, featuring innovations like Kimi Delta Attention (KDA), Gated MLA, and LatentMoE. The release includes infrastructure components such as MoonEP, FlashKDA, and AgentEnv, emphasizing system-level design. Despite open weights, running K3 requires significant hardware investment (minimum 8× MI355X GPUs, production at 64+ GPUs) with costs reaching six figures USD or tens of millions RMB. Hosted access is available via Perplexity, Baseten, and Together. Additionally, agent-based workflows are advancing with mobile orchestration, highlighted by ChatGPT Voice + Codex, Cursor's Start in India powered by Grok 4.5, and Perplexity's Personal Computer local agent with multi-model comparison via Model Council. "If you ever want to feel dumb just read the Kimi K3 technical report" captures community reaction to the dense technical details.

Jul 27

not much happened today

kimi-k3 moonshot vllm baseten modal together-ai ollama dell nvidia mixture-of-experts model-scaling numerical-stability model-architecture open-models model-distribution model-licensing agentic-ai vision scaling-efficiency open-source-infrastructure commercial-restrictions ai-security kimi_moonshot jensenhuang natolambert petergostev artificialanlys

Moonshot released the Kimi K3 open-weights model, a 2.8T-parameter MoE with 104B active parameters, 896 experts, and 1M-token context featuring native visual understanding. The release includes open-source infrastructure like FlashKDA, MoonEP, and AgentENV, enabling large-scale agentic post-training and serving. The technical report highlights a ~2.5× scaling-efficiency improvement over K2 with innovations in numerical stability and MoE routing. Licensing is source-available with commercial-use restrictions, signaling a trend towards open-weight models with business carve-outs. Distribution was broad and immediate via platforms like vLLM, Baseten, Modal, Together, and Ollama Cloud. Separately, NVIDIA launched the Open Secure AI Alliance to build an ecosystem combining open and closed frontier models for AI security, emphasizing defense against attackers already equipped with strong AI.

Jul 15

not much happened today

inkling thinking-machines-lab huggingface vllm_project lmsysorg modal baseten databricks mixture-of-experts multimodality foundation-models model-licensing context-window open-weights model-release miramurati soumithchintala johnschulman2 lilianweng natolambert artificialanlys scaling01

Thinking Machines Lab launched Inkling, its first fully released open-weights foundation model family, featuring 975B parameters with 41B active parameters in a Mixture-of-Experts architecture. Inkling supports multimodality with text, image, and audio inputs and text output, is Apache 2.0 licensed, and offers up to 1M context window. The model is available on platforms like Tinker, Hugging Face, and partners, with broad ecosystem support from vLLM, SGLang, Modal, Baseten, and Databricks. Key figures such as Mira Murati, Soumith Chintala, John Schulman, and Lilian Weng highlighted its open weights, customization, and practical use focus. Independent commentators noted it as the strongest U.S.-based open-weight release to date, though still behind top Chinese open-weight and best closed models on some benchmarks.

Jun 16

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

glm-5.2 z.ai lmsys deepseek cloudflare openrouter ollama baseten deepinfra fireworks notion coding agentic-ai long-context mixture-of-experts sparse-attention speculative-decoding multi-token-prediction model-benchmarking inference-optimization mervenoyann sentdex scaling01 omarsar0 teortaxestex

Z.ai released GLM-5.2, an MIT-licensed open-weight frontier model targeting coding and long-horizon agentic tasks with a 1M-token context window and two reasoning-effort modes. It features a 744B-parameter mixture-of-experts architecture with 40B active parameters per token, built on DeepSeek Sparse Attention extended by IndexShare, and supports improved multi-token prediction (MTP) for speculative decoding. The model achieved strong leaderboard placements, including #3 on FrontierSWE, #1 on Design Arena, and #1 open model on Agent Arena, with ecosystem support from platforms like Transformers, vLLM, SGLang, Cloudflare Workers AI, OpenRouter, Ollama Cloud, Baseten, DeepInfra, Fireworks, and Notion. Early testers praised its potential as a substitute for Opus/GPT-class workflows, though some called for further evaluation and long-horizon validation.

Jun 04

not much happened today

nemotron-3-ultra nemotron-3.5-asr claude-opus-4 mythos-preview nvidia anthropic togethercompute baseten modal vllm_project fireworksai_hq ollama wandb cline primeintellect nousresearch mixture-of-experts long-context model-quantization agentic-ai streaming-speech asr low-precision-training benchmarking recursive-self-improvement code-generation model-speedup piotrz_zelasko

NVIDIA released Nemotron 3 Ultra, a fully open 550B MoE model with 55B active parameters and 1M context, optimized for long-running agent tasks with up to 5x speedup and 30% cost reduction. It features hybrid Mamba/attention, LatentMoE, native MTP, and was pretrained on 20T tokens using NVFP4 low-precision format. Benchmarks show strong performance with 47.7 Intelligence Index and 400+ output tokens/sec. The model is supported across major serving platforms. Additionally, Nemotron 3.5 ASR is an open streaming ASR model with 0.6B parameters, supporting 40 language-locale combinations and sub-100ms latency, designed for voice agents. Anthropic highlighted early signs of recursive self-improvement (RSI) in AI, with Claude models authoring 80%+ of merged code and engineers shipping 8x more code. Claude Opus 4 achieved 3x speedup on training scripts, while Mythos Preview reached ~52x speedup and provided better research suggestions than humans 64% of the time.

Jun 02

Microsoft Build: MAI-Thinking-1 and MAI Family models, Surface RTX Spark Dev Box, and OpenClaw in Windows

mai-thinking-1 mai-code-1-flash holo-3.1 qwen-35b sonnet-4.6 claude-code codex microsoft openrouter fal baseten hcompany_ai teksedge nous-research teknim cognition windsurf perplexity-ai mixture-of-experts context-windows benchmarking reinforcement-learning prompt-optimization agentic-ai local-inference model-family-expansion model-reporting agent-native-devices software-development model-optimization hybrid-inference desktop-agents model-quantization mustafasuleyman eliebakouch hannahajishirzi asadovsky bj2rn lateinteraction lakshyaaagrawal theturingpost kimmonismus yusuf_i_mehdi pierceboggan lukehoban nielsrogge russelljkaplan

Microsoft introduced MAI-Thinking-1, a 35B parameter MoE model with 256K context, achieving 97% on AIME 2025 and outperforming Sonnet 4.6 in human preference tests. The broader 7-model MAI family spans reasoning, code, image, speech, and voice, with third-party availability on OpenRouter, fal, and Baseten. The detailed 109-page technical report revealed insights on scaling, MFU, RL/post-training, and data curation, highlighting no third-party distillation and advanced prompt optimization techniques. Microsoft emphasized agent-native devices and local inference with projects like Project Solara / Scout and the Surface RTX Spark Dev Box, alongside software innovations such as the Copilot desktop app and MAI-Code-1-Flash integration. Meanwhile, local-first computer-use agents like Holo 3.1 (Qwen-based, 0.8B to 35B parameters) support laptops and small workstations with optimized formats and strong benchmark results. Desktop shells for agents, including Hermes Desktop, Devin Desktop, and agent-neutral approaches compatible with Devin, Claude Code, and Codex, are proliferating, with hybrid local/cloud execution becoming the default architecture as seen in Perplexity Computer's hybrid agentic inference.

May 26

not much happened today

eagle-3.1 unigram-tokenizer qwen-3.5 deepseek-v4-pro mimo deep-agents-v0.6 397b-parameter-model eaglecorp vllm_project perplexity_ai alibaba lightseek nvidia mooncake flashattention kimmonismus deepseek xiaomi langchain baseten trajectory clay harvey decagon mercor rogo rlm inference-optimization long-context speculative-decoding tokenization attention-mechanisms kv-cache cache-hierarchy agent-engineering model-harness-memory-fit continual-learning quantization autoscaling memory-centric-agents evaluation-automation kimmonismus _luofuli vtrivedy10

Inference optimization is increasingly architectural, with EAGLE 3.1 improving speculative decoding and long-context handling, collaborating with vLLM and TorchSpec. Perplexity open-sourced a rebuilt Unigram tokenizer cutting CPU use by 5–6× and achieving 63 µs at 514 tokens. Qwen3.5 hits 580 tokens/s via joint efforts from Alibaba, LightSeek, NVIDIA, Mooncake, and FlashAttention-4 contributors. Price cuts in APIs from Chinese labs are sustainable due to structural KV-cache and attention improvements, exemplified by DeepSeek V4-Pro and Xiaomi MiMo reducing caching costs significantly. Agent engineering shifts focus from model quality to model-harness-memory fit, with LangChain releasing Deep Agents v0.6 and tools like LangSmith Engine automating evaluation loops. Trajectory launched a continual learning platform with $15M funding and partners like Clay and Harvey, supporting large models including a 397B-parameter model deployed on autoscaled H100 infrastructure. Open-source memory-centric agents and minimal training harnesses also gained attention.

May 04

not much happened today

gpt-5.2-codex gpt-5.3-codex openai langchain baseten ollama openrouter agent-orchestration context-pipelines coding-agents pricing-models multi-agent-systems workflow-optimization model-agnostic-orchestration prompt-engineering memory-optimization anthony_maio mason_drxy hwchase17 sydneyrunkle naroh teknuim vtrivedy dbreunig zachtratar theo petergostev cheatyyyy

AI Twitter Recap highlights the shift from model-centric AI to context pipelines and agent orchestration as key performance drivers. Notably, gpt-5.2-codex and gpt-5.3-codex showed significant benchmark improvements through prompt and middleware tuning. The ecosystem around open harnesses like Hermes, deepagents, and Flue is rapidly evolving, with innovations in multi-agent coordination and model-agnostic orchestration. Developer workflows are adapting to coding agents such as Codex and Claude Code, with emerging challenges in pricing models due to high token usage in agentic workloads. The practical takeaway is that agent performance depends on the synergy of model × harness × memory/context strategy, not just model weights alone.

Apr 28

not much happened today

vllm-0.20.0 poolside-laguna-xs.2 ling-2.6-flash nemotron-3-nano-omni qwen-3.5 vllm poolside nvidia opensrouter lmstudio ollama unsloth fal fireworks deepinfra togethercompute baseten canonical memory-optimization mixture-of-experts model-optimization inference-speed quantization model-deployment multimodality hardware-optimization model-benchmarking open-models agentic-ai jeremyphoward maharshii teortaxestex aymericroucher piotrz

vLLM v0.20.0 introduces significant improvements in memory and MoE serving efficiency, including TurboQuant 2-bit KV cache for 4× KV capacity and a 2.1% latency improvement. The update supports multiple hardware platforms like DeepSeek V4 MegaMoE on Blackwell, Jetson Thor, ROCm, Intel XPU, and Grace-Blackwell setups. Early benchmarks show DeepSeek V4 Pro on B300 hardware can be up to 8× faster than H200. The ecosystem is rapidly adopting day-0 support for new open models such as Poolside Laguna XS.2, Ling-2.6-flash, and NVIDIA Nemotron 3 Nano Omni. Poolside released Laguna XS.2, a 33B total / 3B active MoE coding model under Apache 2.0, capable of running on a single GPU, with hybrid attention and FP8 KV cache, performing near Qwen-3.5. NVIDIA launched Nemotron 3 Nano Omni, a 30B / A3B multimodal MoE with 256K context, supporting text, image, video, audio, and documents, with immediate distribution across multiple platforms. Discussions highlighted tradeoffs in quantization methods and a shift away from CUDA lock-in towards heterogeneous accelerator support.

Apr 20

not much happened today

kimi-k2.6 qwen-3.6-max-preview moonshot alibaba vllm openrouter cloudflare baseten mlx nous-research opencode ollama mixture-of-experts multimodality int4-quantization long-context agentic-coding multi-agent-systems model-orchestration memory-consolidation llm-driven-replanning dynamic-context-injection

Moonshot's Kimi K2.6 is a major open-weight 1T-parameter MoE model featuring 32B active parameters, 384 experts, MLA attention, 256K context window, native multimodality, and INT4 quantization. It supports day-0 integration with platforms like vLLM, OpenRouter, Cloudflare Workers AI, and others, showcasing state-of-the-art performance on benchmarks such as HLE w/ tools 54.0, SWE-Bench Pro 58.6, and Math Vision w/ python 93.2. The model excels in long-horizon execution with over 4,000 tool calls, 12+ hour continuous runs, and 300 parallel sub-agents. Meanwhile, Alibaba's Qwen3.6-Max-Preview previewed enhanced agentic coding, improved world knowledge, and instruction following, with notable performance on AIME 2026 #15 and ranking in Code Arena. Hermes Agent is rapidly expanding its ecosystem, surpassing 100K GitHub stars and integrating with tools like Ollama and Copilot CLI, while pioneering advanced multi-agent orchestration techniques such as stateless ephemeral units, LLM-driven replanning, and dynamic context injection. These developments highlight the competitive momentum of Chinese open and semi-open labs in coding and agent models.

Mar 11

not much happened today

nemotron-3-super gpt-oss-120b qwen3.5-122b-a10b nvidia perplexity replit base44 vllm llama.cpp ollama togethercompute baseten wandb langchain unsloth model-architecture model-optimization inference-speed kv-cache multi-token-prediction agent-infrastructure orchestration persistent-agents model-serving product-launches karpathy ctnzr bnjmn_marie artificialanlys

NVIDIA’s Nemotron 3 Super is a 120B parameter / ~12B active open model featuring a hybrid Mamba-Transformer / SSM Latent MoE architecture and 1M context window, delivering up to 2.2x faster inference than GPT-OSS-120B in FP4 with strong throughput gains. It supports agentic workloads and is unusually open with weights, data, and infrastructure details released. The model scored 36 on the AA Intelligence Index, outperforming GPT-OSS-120B but behind Qwen3.5-122B-A10B. Community and infrastructure support from projects like vLLM, llama.cpp, Ollama, Together, Baseten, W&B Inference, LangChain, and Unsloth GGUFs was immediate. Key technical innovations include native multi-token prediction (MTP) and a significant KV-cache efficiency advantage. On the product side, a shift towards persistent agent runtimes and orchestration layers is highlighted, with Andrej Karpathy advocating for a "bigger IDE" concept where agents replace files as the unit of work, enabling legible, forkable agentic organizations with real-time control. New launches fitting this vision include Perplexity’s Personal Computer, an always-on local/cloud hybrid running on Mac mini, and Computer for Enterprise orchestrating 20 specialized models and 400+ apps. Replit Agent 4 offers a collaborative, canvas-like workflow with parallel agents, while Base44 Superagents provide integrated solutions for nontechnical users. The engineering focus is increasingly on the orchestration harness rather than just the model.

Jan 22

not much happened today

claude-3 codex gemini gpt-5.2-pro anthropic openai google sakana-ai cursor baseten epoch-ai-research deepmind benchmarking reasoning continual-learning reinforcement-learning model-performance agentic-ai security model-training sama fchollet shane_legg demishassabis

Anthropic launches "Claude in Excel Pro" with enhanced features. OpenAI reveals upcoming Codex agent loop and cybersecurity measures. Google boosts Gemini App quotas and partners with Sakana AI for advanced AI Scientist projects in Japan. Cursor introduces Agent Skills for dynamic context focus. GPT-5.2 Pro achieves 31% on FrontierMath Tier 4, showing significant benchmark progress. Baseten raises $300M at a $5B valuation targeting high-performance inference. Discussions highlight math benchmarks as indicators of AI capability, uneven AGI progress, and the importance of reasoning and continual learning as future frontiers. Notable figures include Sam Altman, François Chollet, Shane Legg, and Demis Hassabis.

Dec 29, 2025

Meta Superintelligence Labs acquires Manus AI for over $2B, at $100M ARR, 9months after launch

glm-4.7 minimax-m2.1 vllm manus benchmark meta-ai-fair vllm amd sglang weaviate teknim baseten alphaxiv minimax performance-optimization inference-frameworks model-benchmarking model-deployment open-source-models multimodality api code-generation community-building alex_wang nat_friedman

Manus achieved a rapid growth trajectory in 2025, raising $500M from Benchmark and reaching $100M ARR before being acquired by Meta for an estimated $4B. The vLLM team launched a dedicated community site with new resources, while performance issues with AMD MI300X FP8 were noted in vLLM and sglang benchmarks. Weaviate released operational features including Object TTL, Java v6 client GA, and multimodal document embeddings. API fragmentation concerns were raised by Teknium advocating for unified SDK wrappers. In open-weight models, GLM-4.7 gained recognition as a reliable coding model with faster throughput on Baseten, and MiniMax-M2.1 rose as a leading open agentic coder model, topping WebDev leaderboards.

Dec 15, 2025

NVIDIA Nemotron 3: hybrid Mamba-Transformer completely open source models from 30B to 500B

nemotron-3-nano qwen3-30b-a3b-base nvidia huggingface togethercompute baseten vllm llamaindex hybrid-architecture mixture-of-experts reinforcement-learning long-context model-release open-source-models model-training model-optimization benchmarking agent-training ctnzr andrew_n_carr awnihannun

NVIDIA has released Nemotron 3 Nano, a fully open-source hybrid Mamba-Transformer Mixture-of-Experts (MoE) model with a 30B parameter size and a 1 million token context window. It includes open weights, training recipes, datasets, and an RL environment suite called NeMo Gym, supporting commercial use under the NVIDIA Open Model License. The model achieves state-of-the-art results on benchmarks like SWE-Bench and Artificial Analysis Intelligence Index, outperforming Qwen3-30B A3B. Ecosystem support is immediate with integrations into inference stacks like vLLM, llama.cpp, and Baseten. Upcoming larger models, Nemotron Super and Ultra, will feature NVFP4 pretraining and LatentMoE routing to optimize compute. This release marks a significant milestone for open-source American AI with comprehensive open assets and advanced hybrid architecture.

Nov 06, 2025

Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch

kimi-k2-thinking gemini moonshot-ai google apple vllm_project arena baseten yupp_ai mixture-of-experts quantization int4 context-window agentic-ai benchmarking model-deployment inference-acceleration api performance-optimization eliebakouch nrehiew_ andrew_n_carr ofirpress artificialanlys sundarpichai akhaliq

Moonshot AI launched Kimi K2 Thinking, a 1 trillion parameter mixture-of-experts (MoE) model with 32 billion active experts, a 256K context window, and native INT4 quantization-aware training. It achieves state-of-the-art results on benchmarks like HLE (44.9%), BrowseComp (60.2%), and agentic tool use with 200-300 sequential tool calls. The model is deployed with vLLM support and OpenAI-compatible APIs, available on platforms like Arena, Baseten, and Yupp. Early user reports note some API instability under launch load. Meanwhile, Google announced the TPU v7 (Ironwood) with a 10× peak performance improvement over TPU v5p, aimed at training and agentic inference for models like Gemini. Apple added support for M5 Neural Accelerators in llama.cpp for inference acceleration.

Oct 27, 2025

MiniMax M2 230BA10B — 8% of Claude Sonnet's price, ~2x faster, new SOTA open model

minimax-m2 hailuo-ai huggingface baseten vllm modelscope openrouter cline sparse-moe model-benchmarking model-architecture instruction-following tool-use api-pricing model-deployment performance-evaluation full-attention qk-norm gqa rope reach_vb artificialanlys akhaliq eliebakouch grad62304977 yifan_zhang_ zpysky1125

MiniMax M2, an open-weight sparse MoE model by Hailuo AI, launches with ≈200–230B parameters and 10B active parameters, offering strong performance near frontier closed models and ranking #5 overall on the Artificial Analysis Intelligence Index v3.0. It supports coding and agent tasks, is licensed under MIT, and is available via API at competitive pricing. The architecture uses full attention, QK-Norm, GQA, partial RoPE, and sigmoid routing, with day-0 support in vLLM and deployment on platforms like Hugging Face and Baseten. Despite verbosity and no tech report, it marks a significant win for open models.

Oct 24, 2025

not much happened today

nemotron-nano-2 gpt-oss-120b qwen3 llama-3 minimax-m2 glm-4.6-air gemini-2.5-flash gpt-5.1-mini tahoe-x1 vllm_project nvidia mistral-ai baseten huggingface thinking-machines deeplearningai pytorch arena yupp-ai zhipu-ai scaling01 stanford transformer-architecture model-optimization inference distributed-training multi-gpu-support performance-optimization agents observability model-evaluation reinforcement-learning model-provenance statistical-testing foundation-models cancer-biology model-fine-tuning swyx dvilasuero _lewtun clementdelangue zephyr_z9 skylermiao7 teortaxestex nalidoust

vLLM announced support for NVIDIA Nemotron Nano 2, featuring a hybrid Transformer–Mamba design and tunable "thinking budget" enabling up to 6× faster token generation. Mistral AI Studio launched a production platform for agents with deep observability. Baseten reported high throughput (650 TPS) for GPT-OSS 120B on NVIDIA hardware. Hugging Face InspectAI added inference provider integration for cross-provider evaluation. Thinking Machines Tinker abstracts distributed fine-tuning for open-weight LLMs like Qwen3 and Llama 3. In China, MiniMax M2 shows competitive performance with top models and is optimized for agents and coding, while Zhipu GLM-4.6-Air focuses on reliability and scaling for coding tasks. Rumors suggest Gemini 2.5 Flash may be a >500B parameter MoE model, and a possible GPT-5.1 mini reference appeared. Outside LLMs, Tahoe-x1 (3B) foundation model achieved SOTA in cancer cell biology benchmarks. Research from Stanford introduces a method to detect model provenance via training-order "palimpsest" with strong statistical guarantees.

Sep 26, 2025

not much happened today

gemini-robotics-1.5 gemini-live embeddinggemma veo-3 gemini-2.5-flash code-world-model-32b qwen3-coder-30b vllm-v1 mlx-lm flashattention-4 google meta-ai-fair perplexity-ai baseten spatial-reasoning temporal-reasoning agentic-ai code-semantics code-execution-traces coding-infrastructure runtime-optimization batch-inference embedding-latency api model-optimization model-performance osanseviero _anniexie rmstein scaling01 giffmana cline redhat_ai awnihannun charles_irl bernhardsson akshat_b aravsrinivas

Google released a dense September update including Gemini Robotics 1.5 with enhanced spatial/temporal reasoning, Gemini Live, EmbeddingGemma, and Veo 3 GA powering creative workflows. They also introduced agentic features like restaurant-reservation agents and reduced pricing for Gemini 2.5 Flash. Meta AI unveiled the open-weight Code World Model (CWM) 32B, excelling in code semantics and math benchmarks, with innovations in training code models via execution traces. Local-first coding setups highlight Qwen3-Coder-30B running efficiently on consumer GPUs, paired with tools like Cline and LM Studio. Runtime improvements include vLLM v1 supporting hybrid models and mlx-lm adding batch inference on Apple silicon. In infrastructure, FlashAttention 4 was reverse-engineered revealing a ~20% speedup from architectural optimizations. Perplexity AI advances its independent web index and browsing API with upcoming feed refreshes. Embedding latency improvements were achieved by Superhuman using Baseten.

Sep 11, 2025

Qwen3-Next-80B-A3B-Base: Towards Ultimate Training & Inference Efficiency

qwen3-next qwen3 mixtral-8x7b gemini-2.5-pro alibaba mistral-ai deepseek snowflake hugging-face baseten nvidia mixture-of-experts model-sparsity gated-attention hybrid-architecture rmsnorm model-stability model-training inference-optimization multi-token-prediction model-deployment justinlin610 teortaxestex yuchenj_uw

MoE (Mixture of Experts) models have become essential in frontier AI models, with Qwen3-Next pushing sparsity further by activating only 3.7% of parameters (3B out of 80B) using a hybrid architecture combining Gated DeltaNet and Gated Attention. This new design includes 512 total experts (10 routed + 1 shared), Zero-Centered RMSNorm for stability, and improved MoE router initialization, resulting in ~10× cheaper training and 10× faster inference compared to previous models. Alibaba's Qwen3-Next reportedly outperforms Gemini-2.5-Flash-Thinking and approaches the flagship 235B model's performance, with deployments on Hugging Face, Baseten, and native vLLM support for efficient inference.

Aug 22, 2025

not much happened today

qwen-image-edit qwen-vl-max kling-2.1 veo-3 deepseek-v3.1 genie-3 sima google-deepmind alibaba google deepseek baseten yupp multimodality embodied-ai simulation fine-tuning quantization video-generation image-generation local-inference scaling agent-training real-time-control spatial-memory demishassabis bonniesjli shreyar ostrisai lmarena_ai teortaxestex ivanfioravanti

DeepMind released Genie 3, an interactive multimodal world simulator with advanced spatial memory and real-time avatar control, and SIMA, an embodied training agent operating inside generated worlds. Alibaba introduced Qwen-Image-Edit, an open-weights image editor scoring ELO 1098 (#2) in the Image Editing Arena, running on Qualcomm NPUs, alongside Qwen-VL-Max entering the Vision top-20. Video models like Kling 2.1 showed a 235% improvement in frame control, with new entrants Luma Ray 2 and Runway Gen-4 Turbo debuting. Google provided free Veo 3 generations in Gemini App and enhanced Google Photos with natural-language edits. DeepSeek v3.1 launched with focus on SWE and Search agents, supporting local inference on Apple Silicon with 4-bit quantization achieving ~21 tok/s on M3 Ultra. The news highlights advances in interactive simulation, vision editing, video synthesis, and scalable local AI inference.

Aug 21, 2025

Cohere Command A Reasoning beats GPT-OSS-120B and DeepSeek R1 0528

command-a-reasoning deepseek-v3.1 cohere deepseek intel huggingface baseten vllm-project chutes-ai anycoder agentic-ai hybrid-models long-context fp8-training mixture-of-experts benchmarking quantization reasoning coding-workflows model-pricing artificialanlys reach_vb scaling01 cline ben_burtenshaw haihaoshen jon_durbin _akhaliq willccbb teortaxestex

Cohere's Command A Reasoning model outperforms GPT-OSS in open deep research capabilities, emphasizing agentic use cases for 2025. DeepSeek-V3.1 introduces a hybrid reasoning architecture toggling between reasoning and non-reasoning modes, optimized for agentic workflows and coding, with extensive long-context pretraining (~630B tokens for 32k context, ~209B for 128k), FP8 training, and a large MoE expert count (~37B). Benchmarks show competitive performance with notable improvements in SWE-Bench and other reasoning tasks. The model supports a $0.56/M input and $1.68/M output pricing on the DeepSeek API and enjoys rapid ecosystem integration including HF weights, INT4 quantization by Intel, and vLLM reasoning toggles. Community feedback highlights the hybrid design's pragmatic approach to agent and software engineering workflows, though some note the lack of tool use in reasoning mode.

Aug 20, 2025

DeepSeek V3.1: 840B token continued pretrain, beating Claude 4 Sonnet at 11% of its cost

deepseek-v3.1 seed-oss-36b computerrl gemini-2.5-pro gpt-5 claude-code gpt-oss-120b gpt-oss-20b deepseek bytedance zhipu-ai github microsoft anthropic together-ai baseten huggingface token-efficiency coding agentic-benchmarks long-context reinforcement-learning developer-tools fine-tuning multinode-training model-release teortaxestex rasbt lukehoban burkeholland _catwu cline winglian

DeepSeek released DeepSeek V3.1, a quietly rolled out open model with an 128K context window and improvements in token efficiency, coding, and agentic benchmarks. ByteDance launched the permissive Seed-OSS 36B model on Hugging Face, noted for long-context and reasoning capabilities. Zhipu AI introduced ComputerRL, a reinforcement learning framework for computer-use agents, achieving strong benchmark results. In developer tooling, GitHub Copilot expanded globally, Microsoft VS Code integrated Gemini 2.5 Pro and updated GPT-5 agent prompts, and Anthropic launched Claude Code seats with spend controls. Open-source fine-tuning advances include Together AI adding SFT for gpt-oss-120B/20B and Baseten enabling multinode 120B training with Truss CLI. The community noted mixed performance and ongoing post-training adjustments for DeepSeek V3.1.

Aug 06, 2025

not much happened today

gpt-oss-120b gpt-oss-20b kimi-k2 deepseek-r1 qwen-3-32b openai huggingface microsoft llamaindex ollama baseten fireworksai cerebras groq together anthropic google uk-aisi sliding-window-attention mixture-of-experts rope context-length mxfp4-format synthetic-data reasoning-core-hypothesis red-teaming benchmarking coding-benchmarks model-performance fine-tuning woj_zaremba sama huybery drjimfan jxmnop scaling01 arunv30 kevinweil xikun_zhang_ jerryjliu0 ollama basetenco reach_vb gneubig shxf0072 _lewtun

OpenAI released its first open models since GPT-2, gpt-oss-120b and gpt-oss-20b, which quickly trended on Hugging Face. Microsoft supports these models via Azure AI Foundry and Windows Foundry Local. Key architectural innovations include sliding window attention, mixture of experts (MoE), a RoPE variant, and a 256k context length. The models use a new MXFP4 format supported by llama.cpp. Hypotheses suggest gpt-oss was trained on synthetic data to enhance safety and performance, supporting the Reasoning Core Hypothesis. OpenAI announced a $500K bounty for red teaming with partners including Anthropic, Google, and the UK AISI. Performance critiques highlight inconsistent benchmarking results, with GPT-OSS-120B scoring 41.8% on the Aider Polyglot coding benchmark, trailing competitors like Kimi-K2 and DeepSeek-R1. Some users note the model excels in math and reasoning but lacks common sense and practical utility.

Aug 22, 2024

Ideogram 2 + Berkeley Function Calling Leaderboard V2

llama-3-70b gpt-4 phi-3.5 functionary-llama-3-70b llama-3 ideogram midjourney berkeley openai hugging-face microsoft meta-ai-fair baseten kai claude functionary function-calling benchmarking image-generation model-optimization vision multimodality model-performance fine-tuning context-windows cybersecurity code-analysis ai-assisted-development

Ideogram returns with a new image generation model featuring color palette control, a fully controllable API, and an iOS app, reaching a milestone of 1 billion images created. Meanwhile, Midjourney released a Web UI but still lacks an API. In function calling, the Berkeley Function Calling Leaderboard (BFCL) updated to BFCL V2 • Live, adding 2251 live, user-contributed function documentation and queries to improve evaluation quality. GPT-4 leads the leaderboard, but the open-source Functionary Llama 3-70B finetune from Kai surpasses Claude. On AI model releases, Microsoft launched three Phi-3.5 models with impressive reasoning and context window capabilities, while Meta AI FAIR introduced UniBench, a unified benchmark suite for over 50 vision-language model tasks. Baseten improved Llama 3 inference speed by up to 122% using Medusa. A new cybersecurity benchmark, Cyberbench, featuring 40 CTF tasks, was released. Additionally, Codegen was introduced as a tool for programmatic codebase analysis and AI-assisted development. "Multiple functions > parallel functions" was highlighted as a key insight in function calling.

May 23, 2024

ALL of AI Engineering in One Place

claude-3-sonnet claude-3 openai google-deepmind anthropic mistral-ai cohere hugging-face adept midjourney character-ai microsoft amazon nvidia salesforce mastercard palo-alto-networks axa novartis discord twilio tinder khan-academy sourcegraph mongodb neo4j hasura modular cognition anysphere perplexity-ai groq mozilla nous-research galileo unsloth langchain llamaindex instructor weights-biases lambda-labs neptune datastax crusoe covalent qdrant baseten e2b octo-ai gradient-ai lancedb log10 deepgram outlines crew-ai factory-ai interpretability feature-steering safety multilinguality multimodality rag evals-ops open-models code-generation gpus agents ai-leadership

The upcoming AI Engineer World's Fair in San Francisco from June 25-27 will feature a significantly expanded format with booths, talks, and workshops from top model labs like OpenAI, DeepMind, Anthropic, Mistral, Cohere, HuggingFace, and Character.ai. It includes participation from Microsoft Azure, Amazon AWS, Google Vertex, and major companies such as Nvidia, Salesforce, Mastercard, Palo Alto Networks, and more. The event covers 9 tracks including RAG, multimodality, evals/ops, open models, code generation, GPUs, agents, AI in Fortune 500, and a new AI leadership track. Additionally, Anthropic shared interpretability research on Claude 3 Sonnet, revealing millions of interpretable features that can be steered to modify model behavior, including safety-relevant features related to bias and unsafe content, though more research is needed for practical applications. The event offers a discount code for AI News readers.