All tags
Topic: "multi-token-prediction"
Qwen3-Next-80B-A3B-Base: Towards Ultimate Training & Inference Efficiency
qwen3-next qwen3 mixtral-8x7b gemini-2.5-pro alibaba mistral-ai deepseek snowflake hugging-face baseten nvidia mixture-of-experts model-sparsity gated-attention hybrid-architecture rmsnorm model-stability model-training inference-optimization multi-token-prediction model-deployment justinlin610 teortaxestex yuchenj_uw
MoE (Mixture of Experts) models have become essential in frontier AI models, with Qwen3-Next pushing sparsity further by activating only 3.7% of parameters (3B out of 80B) using a hybrid architecture combining Gated DeltaNet and Gated Attention. This new design includes 512 total experts (10 routed + 1 shared), Zero-Centered RMSNorm for stability, and improved MoE router initialization, resulting in ~10× cheaper training and 10× faster inference compared to previous models. Alibaba's Qwen3-Next reportedly outperforms Gemini-2.5-Flash-Thinking and approaches the flagship 235B model's performance, with deployments on Hugging Face, Baseten, and native vLLM support for efficient inference.
Anthropic raises $13B at $183B Series F
claude-code gpt-5 grok-4 claude sonnet-4 glm-4.5 deepseek-r1 anthropic mistral-ai x-ai salesforce galileo openpipe zhipu thudm enterprise-connectors agent-benchmarking reinforcement-learning inference-optimization memory-optimization cuda multi-token-prediction speculative-decoding tensor-offload performance-optimization real-time-guardrails cost-optimization swyx emilygsands _philschmid _lewtun omarsar0 _avichawla corbtt
Anthropic achieved a $183B post-money valuation in Series F funding by September 2025, growing from about $1B run-rate in January to over $5B run-rate by August 2025. Their Claude Code product saw >10x usage growth in three months and reached $500M run-rate revenue, serving over 300,000 business customers with a nearly 7x increase in large accounts. Mistral AI launched Le Chat with 20+ MCP connectors integrating with major SaaS platforms and persistent memory features. Benchmarking updates highlight GPT-5 leading agent intelligence indices, with strong performances from xAI's Grok and Anthropic's Claude families. Reliability tooling and agent evaluation advances were shared by Galileo, OpenPipe, and others. Zhipu/THUDM open-sourced Slime v0.1.0, enhancing RL infrastructure behind GLM-4.5 with significant decoding speed improvements and advanced tensor offload techniques.
DeepSeek v3: 671B finegrained MoE trained for $5.5m USD of compute on 15T tokens
deepseek-v3 gpt-4o claude-3.5-sonnet llama-3 deepseek-ai hugging-face openai anthropic mixture-of-experts model-training model-optimization reinforcement-learning chain-of-thought multi-token-prediction synthetic-data model-distillation fine-tuning attention-mechanisms gpu-optimization nrehiew_ denny_zhou
DeepSeek-V3 has launched with 671B MoE parameters and trained on 14.8T tokens, outperforming GPT-4o and Claude-3.5-sonnet in benchmarks. It was trained with only 2.788M H800 GPU hours, significantly less than Llama-3's 30.8M GPU-hours, showcasing major compute efficiency and cost reduction. The model is open-source and deployed via Hugging Face with API support. Innovations include native FP8 mixed precision training, Multi-Head Latent Attention scaling, distillation from synthetic reasoning data, pruning and healing for MoEs with up to 256 experts, and a new multi-token prediction objective enabling lookahead token planning. Research highlights also cover the OREO method and Natural Language Reinforcement Learning (NLRL) for multi-step reasoning and agent control.