All tags
Company: "vllm"
not much happened today
kimi-k2 qwen3-next nemotron-nano-2 granite-4.0 gpt-4.5 copilot codex vllm perplexity-ai ibm anthropic graphiti claude cursor-ai microsoft mixture-of-experts model-integration cloud-computing hybrid-models benchmarking agent-systems memory-persistence semantic-search code-retrieval context-length-optimization tool-use evaluation-frameworks software-development scaling01 cedric_chee aravsrinivas omarsar0 _avichawla pierceboggan jo_parkhurst jyangballin ofirpress ml_angelopoulos
Kimi-K2 Reasoner has been integrated into vLLM and will soon be supported by SGLang, featuring a massive 1.2 trillion parameter MoE configuration. Perplexity AI released research on cloud-portable trillion-parameter MoE kernels optimized for AWS EFA, with potential integration into vLLM. IBM's vLLM team formalized hybrid dense and sparse expert models, supporting models like Qwen3-Next, Nemotron Nano 2, and Granite 4.0. Kimi-K2 reportedly scores 77% on GPQA Diamond, outperforming GPT-4.5 at 71.4%, though this is unverified.
Anthropic published a guide on efficient tool-heavy agent systems using MCP patterns, drastically reducing context tokens by ~98.7%. Graphiti MCP demonstrated shared memory across apps like Claude Desktop and Cursor for persistent agent memory. VS Code introduced an "Agent sessions" feature to unify agent management, including Copilot and Codex. Cursor AI improved coding accuracy via semantic search and code retrieval embeddings. New evaluation frameworks like CodeClash and LMArena assess agent and coding model performance in realistic multi-round tasks and occupation-tagged leaderboards.
not much happened today
qwen3-max-thinking minimax-m2 claude-3-sonnet llamaindex-light chronos-2 openai aws microsoft nvidia gpu_mode vllm alibaba arena llamaindex amazon anthropic gradio compute-deals gpu-optimization kernel-optimization local-serving reasoning long-context benchmarks long-term-memory time-series-forecasting agent-frameworks oauth-integration developer-tools sama gdb andrewcurran_ a1zhang m_sirovatka omarsar0 _philschmid
OpenAI and AWS announced a strategic partnership involving a $38B compute deal to deploy hundreds of thousands of NVIDIA GB200 and GB300 chips, while Microsoft secured a license to ship NVIDIA GPUs to the UAE with a planned $7.9B datacenter investment. A 3-month NVFP4 kernel optimization competition on Blackwell B200s was launched by NVIDIA and GPU_MODE with prizes including DGX Spark and RTX 50XX GPUs. vLLM gains traction for local LLM serving, exemplified by PewDiePie's adoption. Alibaba previewed the Qwen3-Max-Thinking model hitting 100% on AIME 2025 and HMMT benchmarks, signaling advances in reasoning with tool use. The MIT-licensed MiniMax-M2 230B MoE model topped the Arena WebDev leaderboard, tying with Claude Sonnet 4.5 Thinking 32k. Critiques emerged on OSWorld benchmark stability and task validity. LlamaIndex's LIGHT framework demonstrated significant improvements in long-term memory tasks over raw context and RAG baselines, with gains up to +160.6% in summarization at 10M tokens. Amazon introduced Chronos-2, a time-series foundation model for zero-shot forecasting. The MCP ecosystem expanded with new tools like mcp2py OAuth integration and Gemini Docs MCP server, alongside a build sprint by Anthropic and Gradio offering substantial credits and prizes. "OSWorld doesn’t really exist—different prompt sets = incomparable scores" highlights benchmarking challenges.
MiniMax M2 230BA10B — 8% of Claude Sonnet's price, ~2x faster, new SOTA open model
minimax-m2 hailuo-ai huggingface baseten vllm modelscope openrouter cline sparse-moe model-benchmarking model-architecture instruction-following tool-use api-pricing model-deployment performance-evaluation full-attention qk-norm gqa rope reach_vb artificialanlys akhaliq eliebakouch grad62304977 yifan_zhang_ zpysky1125
MiniMax M2, an open-weight sparse MoE model by Hailuo AI, launches with ≈200–230B parameters and 10B active parameters, offering strong performance near frontier closed models and ranking #5 overall on the Artificial Analysis Intelligence Index v3.0. It supports coding and agent tasks, is licensed under MIT, and is available via API at competitive pricing. The architecture uses full attention, QK-Norm, GQA, partial RoPE, and sigmoid routing, with day-0 support in vLLM and deployment on platforms like Hugging Face and Baseten. Despite verbosity and no tech report, it marks a significant win for open models.
not much happened today
vllm chatgpt-atlas langchain meta microsoft openai pytorch ray claude agent-frameworks reinforcement-learning distributed-computing inference-correctness serving-infrastructure browser-agents security middleware runtime-systems documentation hwchase17 soumithchintala masondrxy robertnishihara cryps1s yuchenj_uw
LangChain & LangGraph 1.0 released with major updates for reliable, controllable agents and unified docs, emphasizing "Agent Engineering." Meta introduced PyTorch Monarch and TorchForge for distributed programming and reinforcement learning, enabling large-scale agentic systems. Microsoft Learn MCP server now integrates with tools like Claude Code and VS Code for instant doc querying, accelerating grounded agent workflows. vLLM improved inference correctness with token ID returns and batch-invariant inference, collaborating with Ray for orchestration in PyTorch Foundation. OpenAI launched ChatGPT Atlas, a browser agent with contextual Q&A and advanced safety features, though early users note maturity challenges and caution around credential access.
not much happened today
gpt-5-pro gemini-2.5 vllm deepseek-v3.1 openai google-deepmind microsoft epoch-ai-research togethercompute nvidia mila reasoning reinforcement-learning inference speculative-decoding sparse-attention kv-cache-management throughput-optimization compute-efficiency tokenization epochairesearch yitayml _philschmid jiqizhixin cvenhoff00 neelnanda5 lateinteraction mgoin_ blackhc teortaxestex
FrontierMath Tier 4 results show GPT-5 Pro narrowly outperforming Gemini 2.5 Deep Think in reasoning accuracy, with concerns about problem leakage clarified by Epoch AI Research. Mila and Microsoft propose Markovian Thinking to improve reasoning efficiency, enabling models to reason over 24K tokens with less compute. New research suggests base models inherently contain reasoning mechanisms, with "thinking models" learning to invoke them effectively. In systems, NVIDIA Blackwell combined with vLLM wins InferenceMAX with significant throughput gains, while Together AI's ATLAS adaptive speculative decoding achieves 4× speed improvements and reduces RL training time by over 60%. SparseServe introduces dynamic sparse attention with KV tiering, drastically improving throughput and latency in GPU memory management.
Oracle jumps +36% in a day after winning $300B OpenAI contract
qwen3-235b qwen3-4b qwen2.5-7b vllm oracle openai microsoft moonshot-ai vllm-project thinking-machines-lab meta reinforcement-learning model-weight-updates deterministic-inference benchmarking long-context model-optimization cuda distributed-training kimi_moonshot arankomatsuzaki qgallouedec cHHillee woosuk_k stasbekman
Oracle's OCI division reported a stunning +359% revenue bookings growth to $455B with cloud revenue guidance of $144B by 2030, driven significantly by a large deal with OpenAI amid tensions with Microsoft. On AI infrastructure, Moonshot AI released Kimi’s checkpoint-engine, enabling rapid weight updates on 1T-parameter models across thousands of GPUs, integrating with vLLM. RLFactory introduced a plug-and-play reinforcement learning framework for tool-using agents, showing smaller models outperforming larger ones. TRL v0.23 added context parallelism for long-context training. Thinking Machines Lab published research on deterministic inference pipelines, making vLLM deterministic for Qwen models. Meta launched BackendBench, a PyTorch benchmarking tool.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.
LlamaCon: Meta AI gets into the Llama API platform business
llama-4 qwen3 qwen3-235b-a22b qwen3-30b-a3b qwen3-4b qwen2-5-72b-instruct o3-mini meta-ai-fair cerebras groq alibaba vllm ollama llamaindex hugging-face llama-cpp model-release fine-tuning reinforcement-learning moe multilingual-models model-optimization model-deployment coding benchmarking apache-license reach_vb huybery teortaxestex awnihannun thezachmueller
Meta celebrated progress in the Llama ecosystem at LlamaCon, launching an AI Developer platform with finetuning and fast inference powered by Cerebras and Groq hardware, though it remains waitlisted. Meanwhile, Alibaba released the Qwen3 family of large language models, including two MoE models and six dense models ranging from 0.6B to 235B parameters, with the flagship Qwen3-235B-A22B achieving competitive benchmark results and supporting 119 languages and dialects. The Qwen3 models are optimized for coding and agentic capabilities, are Apache 2.0 licensed, and have broad deployment support including local usage with tools like vLLM, Ollama, and llama.cpp. Community feedback highlights Qwen3's scalable performance and superiority over models like OpenAI's o3-mini.
Cognition's DeepWiki, a free encyclopedia of all GitHub repos
o4-mini perception-encoder qwen-2.5-vl dia-1.6b grok-3 gemini-2.5-pro claude-3.7 gpt-4.1 cognition meta-ai-fair alibaba hugging-face openai perplexity-ai vllm vision text-to-speech reinforcement-learning ocr model-releases model-integration open-source frameworks chatbots model-selector silas-alberti mervenoyann reach_vb aravsrinivas vikparuchuri lioronai
Silas Alberti of Cognition announced DeepWiki, a free encyclopedia of all GitHub repos providing Wikipedia-like descriptions and Devin-backed chatbots for public repos. Meta released Perception Encoders (PE) with A2.0 license, outperforming InternVL3 and Qwen2.5VL on vision tasks. Alibaba launched the Qwen Chat App for iOS and Android. Hugging Face integrated the Dia 1.6B SoTA text-to-speech model via FAL. OpenAI expanded deep research usage with a lightweight version powered by o4-mini model, now available to free users. Perplexity AI updated their model selector with Grok 3 Beta, o4-mini, and support for models like gemini 2.5 pro, claude 3.7, and gpt-4.1. vLLM project introduced OpenRLHF framework for reinforcement learning with human feedback. Surya OCR alpha model supports 90+ languages and LaTeX. MegaParse open-source library was introduced for LLM-ready data formats.
not much happened today
vllm deepseek-v3 llamaindex openai deepseek qdrant twilio llamaindex elevenlabs training-efficiency parallelism cpu-offloading gradient-descent mixture-of-experts fp8-precision memory-optimization ai-voice-assistants coding-assistants document-processing version-control learning-rate-schedules federated-learning agentic-systems multi-agent-systems deliberative-alignment chain-of-thought on-device-ai multimodality francois-fleuret daniel-hanchen aaron-defazio fchollet elad-gil wojciech-zaremba richard-socher
ChatGPT, Sora, and the OpenAI API experienced a >5 hour outage but are now restored. Updates to vLLM enable DeepSeek-V3 to run with enhanced parallelism and CPU offloading, improving model deployment flexibility. Discussions on gradient descent in top-k routing MoE and adoption of FP8 precision focus on training efficiency and memory optimization. AIDE, an AI voice medical assistant by Team Therasync, leverages Qdrant, OpenAI, and Twilio. DeepSeek-Engineer offers AI-powered coding assistance with structured outputs. LlamaIndex integrates LlamaCloud and ElevenLabs for large-scale document processing and voice interaction. Insights on version control with ghstack and advocacy for linear decay learning rate schedules highlight best practices in AI development. Experts predict smaller, tighter models, true multimodal models, and on-device AI in 2025. Proposals for planetary-scale federated learning and community AGI moonshots emphasize future AI directions. Discussions on agentic systems, multi-agent workflows, and deliberative alignment through chain of thought reasoning underscore AI safety and alignment efforts.
Pixtral Large (124B) beats Llama 3.2 90B with updated Mistral Large 24.11
pixtral-large mistral-large-24.11 llama-3-2 qwen2.5-7b-instruct-abliterated-v2-gguf qwen2.5-32b-q3_k_m vllm llama-cpp exllamav2 tabbyapi mistral-ai sambanova nvidia multimodality vision model-updates chatbots inference gpu-optimization quantization performance concurrency kv-cache arthur-mensch
Mistral has updated its Pixtral Large vision encoder to 1B parameters and released an update to the 123B parameter Mistral Large 24.11 model, though the update lacks major new features. Pixtral Large outperforms Llama 3.2 90B on multimodal benchmarks despite having a smaller vision adapter. Mistral's Le Chat chatbot received comprehensive feature updates, reflecting a company focus on product and research balance as noted by Arthur Mensch. SambaNova sponsors inference with their RDUs offering faster AI model processing than GPUs. On Reddit, vLLM shows strong concurrency performance on an RTX 3090 GPU, with quantization challenges noted in FP8 kv-cache but better results using llama.cpp with Q8 kv-cache. Users discuss performance trade-offs between vLLM, exllamav2, and TabbyAPI for different model sizes and batching strategies.