All tags
Person: "ofirpress"
Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch
kimi-k2-thinking gemini moonshot-ai google apple vllm_project arena baseten yupp_ai mixture-of-experts quantization int4 context-window agentic-ai benchmarking model-deployment inference-acceleration api performance-optimization eliebakouch nrehiew_ andrew_n_carr ofirpress artificialanlys sundarpichai akhaliq
Moonshot AI launched Kimi K2 Thinking, a 1 trillion parameter mixture-of-experts (MoE) model with 32 billion active experts, a 256K context window, and native INT4 quantization-aware training. It achieves state-of-the-art results on benchmarks like HLE (44.9%), BrowseComp (60.2%), and agentic tool use with 200-300 sequential tool calls. The model is deployed with vLLM support and OpenAI-compatible APIs, available on platforms like Arena, Baseten, and Yupp. Early user reports note some API instability under launch load. Meanwhile, Google announced the TPU v7 (Ironwood) with a 10× peak performance improvement over TPU v5p, aimed at training and agentic inference for models like Gemini. Apple added support for M5 Neural Accelerators in llama.cpp for inference acceleration.
not much happened today
kimi-k2 qwen3-next nemotron-nano-2 granite-4.0 gpt-4.5 copilot codex vllm perplexity-ai ibm anthropic graphiti claude cursor-ai microsoft mixture-of-experts model-integration cloud-computing hybrid-models benchmarking agent-systems memory-persistence semantic-search code-retrieval context-length-optimization tool-use evaluation-frameworks software-development scaling01 cedric_chee aravsrinivas omarsar0 _avichawla pierceboggan jo_parkhurst jyangballin ofirpress ml_angelopoulos
Kimi-K2 Reasoner has been integrated into vLLM and will soon be supported by SGLang, featuring a massive 1.2 trillion parameter MoE configuration. Perplexity AI released research on cloud-portable trillion-parameter MoE kernels optimized for AWS EFA, with potential integration into vLLM. IBM's vLLM team formalized hybrid dense and sparse expert models, supporting models like Qwen3-Next, Nemotron Nano 2, and Granite 4.0. Kimi-K2 reportedly scores 77% on GPQA Diamond, outperforming GPT-4.5 at 71.4%, though this is unverified.
Anthropic published a guide on efficient tool-heavy agent systems using MCP patterns, drastically reducing context tokens by ~98.7%. Graphiti MCP demonstrated shared memory across apps like Claude Desktop and Cursor for persistent agent memory. VS Code introduced an "Agent sessions" feature to unify agent management, including Copilot and Codex. Cursor AI improved coding accuracy via semantic search and code retrieval embeddings. New evaluation frameworks like CodeClash and LMArena assess agent and coding model performance in realistic multi-round tasks and occupation-tagged leaderboards.
GPT-5 Codex launch and OpenAI's quiet rise in Agentic Coding
gpt-5-codex qwen3-next-80b openai alibaba together-ai nvidia agentic-ai software-engineering long-context mixture-of-experts model-optimization cuda-acceleration inference-efficiency routing task-adaptive-thinking sama swyx omarsar0 ofirpress
OpenAI released GPT-5-Codex, an agentic coding model optimized for long-running software engineering tasks with dynamic task-adaptive thinking, multi-hour autonomy, and improved code quality. It achieves 51% accuracy on an unreleased large refactor benchmark and integrates deeply with developer tools like Xcode. Meanwhile, Alibaba launched Qwen3-Next-80B, a hybrid MoE model with native long-context support (262k tokens, extensible to 1M+), targeting efficient reasoning and repository-scale code analysis, supported by Together AI and NVIDIA with CUDA-accelerated attention. The trend towards hybrid SSM + MoE architectures is noted, emphasizing efficiency and scaling in China and US training regimes. Community discussions highlight the importance of variable compute and routing for inference efficiency and quality.
OpenAI rolls out GPT-5 and GPT-5 Thinking to >1B users worldwide; -mini and -nano help claim Pareto Frontier
gpt-5 gpt-5-mini gpt-5-nano claude-4.1-sonnet claude-4.1-opus openai cursor_ai jetbrains microsoft notion perplexity_ai factoryai model-architecture context-windows pricing-models coding long-context prompt-engineering model-benchmarking model-integration tool-use reasoning sama scaling01 jeffintime embirico mustafasuleyman cline lmarena_ai nrehiew_ ofirpress sauers_
OpenAI launched GPT-5, a unified system featuring a fast main model and a deeper thinking model with a real-time router, supporting up to 400K context length and aggressive pricing that reclaims the Pareto Frontier of Intelligence. The rollout includes variants like gpt-5-mini and gpt-5-nano with significant cost reductions, and integrations with products such as ChatGPT, Cursor AI, JetBrains AI Assistant, Microsoft Copilot, Notion AI, and Perplexity AI. Benchmarks show GPT-5 performing strongly in coding and long-context reasoning, roughly matching Claude 4.1 Sonnet/Opus on SWE-bench Verified. The launch was accompanied by a GPT-5 prompting cookbook and notable community discussions on pricing and performance.
not much happened today
aria o1-preview o1-mini gemini-1.5-pro gemini-1.5-flash gemini-1.5 claude-3.5-sonnet rhymes-ai openai anthropic google meta-ai-fair oxylabs multimodality mixture-of-experts long-context retrieval-augmented-generation benchmarking software-engineering llm-evaluation prompt-engineering web-scraping python production-applications mervenoyann osanseviero dbrxmosaicai ylecun ofirpress clefourrier omarsar0 rohanpaul_ai svpino finbarrtimbers _philschmid
Rhymes AI released Aria, a new 25.3B parameter multimodal MoE model supporting text, code, image, and video with a 64k token context window and Apache-2.0 license. OpenAI's o1-preview and o1-mini models show consistent improvement over Anthropic and Google Gemini 1.5 Pro/Flash on long context RAG benchmarks up to 128k tokens, while Google Gemini 1.5 models excel at extreme context lengths up to 2 million tokens. Meta AI expanded rollout to 21 countries with new language support but remains unavailable in the EU. The one-year anniversary of SWE-bench benchmark for software engineering tasks was celebrated, alongside the introduction of SWE-bench Multimodal. New AI tools include OxyCopilot by Oxylabs for web scraping, Taipy for Python-based production apps, and Latitude for prompt engineering. Industry insights highlight changing AI funding dynamics and OpenAI's strategic focus on consumer products like ChatGPT. "all recaps done by Claude 3.5 Sonnet, best of 4 runs."