All tags
Topic: "kernel-optimization"
not much happened today
DeepSeek released a new paper on mHC: Manifold-Constrained Hyper-Connections, advancing residual-path design as a key scaling lever in neural networks. Their approach constrains residual mixing matrices to the Birkhoff polytope to improve stability and performance, with only about 6.7% training overhead. The innovation includes systems-level optimizations like fused kernels and activation recomputation, highlighting a frontier-lab integration of math and kernel engineering. Additionally, discussions around long-horizon agents emphasize context management bottlenecks, introducing Recursive Language Models (RLMs) that manage context dynamically rather than relying on larger context windows. This work signals a shift in architectural design and efficiency for base model training and agent development.
not much happened today
nomos-1 axiomprover devstral-2-small deepseek-v3.2 claude-code cursor-2.2 claude-opus-4.5 gpt-5 claude-sonnet-4.5 gemini-3-pro llama qwen mistral gemma nousresearch thinkymachines mistral-ai deepseek anthropic cursor microsoft langchain-ai openai gemini intel vllm_project danielhanchen math formal-reasoning agentic-systems asynchronous-execution multi-agent-systems observability benchmarking quantization post-training-quantization training-speedup kernel-optimization inference-efficiency
NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack. Mistral's Devstral 2 Small outperforms DeepSeek v3.2 in 71% of preferences with better speed and cost. Anthropic's Claude Code introduces asynchronous agent execution. Cursor 2.2 adds deep agent primitives like Debug and Plan Modes. VS Code launches unified agent chat sessions improving multi-agent workflows. LangChain releases "Polly" for agent observability. The Stirrup harness leads OpenAI GDPval benchmarks with Claude Opus 4.5, GPT-5, and Gemini 3 Pro following. Advances in quantization include vLLM integrating Intel's AutoRound PTQ for efficient serving. Unsloth achieves up to 3× training speedups with new kernels across Llama, Qwen, Mistral, and Gemma models. "Compositional reasoning + specialized post-training under constrained active params can rival frontier closed models on formal math."
not much happened today
qwen3-max-thinking minimax-m2 claude-3-sonnet llamaindex-light chronos-2 openai aws microsoft nvidia gpu_mode vllm alibaba arena llamaindex amazon anthropic gradio compute-deals gpu-optimization kernel-optimization local-serving reasoning long-context benchmarks long-term-memory time-series-forecasting agent-frameworks oauth-integration developer-tools sama gdb andrewcurran_ a1zhang m_sirovatka omarsar0 _philschmid
OpenAI and AWS announced a strategic partnership involving a $38B compute deal to deploy hundreds of thousands of NVIDIA GB200 and GB300 chips, while Microsoft secured a license to ship NVIDIA GPUs to the UAE with a planned $7.9B datacenter investment. A 3-month NVFP4 kernel optimization competition on Blackwell B200s was launched by NVIDIA and GPU_MODE with prizes including DGX Spark and RTX 50XX GPUs. vLLM gains traction for local LLM serving, exemplified by PewDiePie's adoption. Alibaba previewed the Qwen3-Max-Thinking model hitting 100% on AIME 2025 and HMMT benchmarks, signaling advances in reasoning with tool use. The MIT-licensed MiniMax-M2 230B MoE model topped the Arena WebDev leaderboard, tying with Claude Sonnet 4.5 Thinking 32k. Critiques emerged on OSWorld benchmark stability and task validity. LlamaIndex's LIGHT framework demonstrated significant improvements in long-term memory tasks over raw context and RAG baselines, with gains up to +160.6% in summarization at 10M tokens. Amazon introduced Chronos-2, a time-series foundation model for zero-shot forecasting. The MCP ecosystem expanded with new tools like mcp2py OAuth integration and Gemini Docs MCP server, alongside a build sprint by Anthropic and Gradio offering substantial credits and prizes. "OSWorld doesn’t really exist—different prompt sets = incomparable scores" highlights benchmarking challenges.