Person: "ofirpress"

claude-fable-5 mythos-5 gpt-5.5 claude-code fable-5 codex opus-4.8 kimi-k2.7-code anthropic artificial-analysis datacurve moonshot model-sovereignty export-controls coding-agent-evaluation benchmarking benchmark-gaming harness-quality benchmark-saturation open-source-models natolambert theo cohere kunchenguid clementdelangue dejavucoder ofirpress ramplabs

Anthropic suspended access to Claude Fable 5 and Mythos 5 due to US export controls, sparking a debate on model sovereignty and geopolitical risks for frontier AI vendors. Artificial Analysis updated its coding agent benchmark, replacing SWE-Bench Pro with DeepSWE, reshuffling rankings with Claude Code + Fable 5 [max] leading. Discussions highlighted the importance of harness quality versus pure model capability and concerns over benchmark saturation and realism. Additionally, Moonshot released the open-source model Kimi K2.7-Code.

May 29

not much happened today

claude-opus-4.8 gpt-5.5 qwen kimi deepseek anthropic huggingface langchain vllm_project reinforcement-learning tokenization agentic-ai api model-optimization long-context rust performance-optimization multi-agent-systems prompt-engineering jeremyphoward leo_linsky clementdelangue johnschulman2 omarsar0 hwchase17 ofirpress scaling01

Anthropic rolled out Claude Opus 4.8, which shows incremental improvements but mixed benchmark results, including better cooperation and coding behavior but some regressions in document parsing. Platform updates include mid-conversation system instructions enhancing long agent sessions, though API pricing remains a concern. A Hugging Face analysis revealed a critical bug in multi-turn reinforcement learning training loops involving tokenization mismatches, with a proposed "Token-In, Token-Out" fix. Agent harness design is evolving as a key optimization area, with LangChain's Deep Agents v0.6 achieving strong performance at much lower cost, and vllm_project releasing native weight syncing APIs and a Rust BPE tokenizer to improve tokenization efficiency. Debate continues on the value of multi-agent systems, with some seeing them as speedups and others expecting capability breakthroughs.

Nov 21, 2025

AI Engineer Code Summit

gemini-3-pro-image gemini-3 gpt-5 claude-3.7-sonnet google-deepmind togethercompute image-generation fine-tuning benchmarking agentic-ai physics model-performance instruction-following model-comparison time-horizon user-preference demishassabis omarsar0 lintool hrishioa teknium artificialanlys minyangtian1 ofirpress metr_evals scaling01

The recent AIE Code Summit showcased key developments including Google DeepMind's Gemini 3 Pro Image model, Nano Banana Pro, which features enhanced text rendering, 4K visuals, and fine-grained editing capabilities. Community feedback highlights its strong performance in design and visualization tasks, with high user preference scores. Benchmarking updates reveal the new CritPt physics frontier benchmark where Gemini 3 Pro outperforms GPT-5, though AI still lags on complex unseen research problems. Agentic task evaluations show varied time horizons and performance gaps between open-weight and closed frontier models, emphasizing ongoing challenges in AI research and deployment. "Instruction following remains jagged for some users," and model fit varies by use case, with Gemini 3 excelling in UI and code tasks but showing regressions in transcription and writing fidelity.

Nov 06, 2025

Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch

kimi-k2-thinking gemini moonshot-ai google apple vllm_project arena baseten yupp_ai mixture-of-experts quantization int4 context-window agentic-ai benchmarking model-deployment inference-acceleration api performance-optimization eliebakouch nrehiew_ andrew_n_carr ofirpress artificialanlys sundarpichai akhaliq

Moonshot AI launched Kimi K2 Thinking, a 1 trillion parameter mixture-of-experts (MoE) model with 32 billion active experts, a 256K context window, and native INT4 quantization-aware training. It achieves state-of-the-art results on benchmarks like HLE (44.9%), BrowseComp (60.2%), and agentic tool use with 200-300 sequential tool calls. The model is deployed with vLLM support and OpenAI-compatible APIs, available on platforms like Arena, Baseten, and Yupp. Early user reports note some API instability under launch load. Meanwhile, Google announced the TPU v7 (Ironwood) with a 10× peak performance improvement over TPU v5p, aimed at training and agentic inference for models like Gemini. Apple added support for M5 Neural Accelerators in llama.cpp for inference acceleration.

Nov 05, 2025

not much happened today

kimi-k2 qwen3-next nemotron-nano-2 granite-4.0 gpt-4.5 copilot codex vllm perplexity-ai ibm anthropic graphiti claude cursor-ai microsoft mixture-of-experts model-integration cloud-computing hybrid-models benchmarking agent-systems memory-persistence semantic-search code-retrieval context-length-optimization tool-use evaluation-frameworks software-development scaling01 cedric_chee aravsrinivas omarsar0 _avichawla pierceboggan jo_parkhurst jyangballin ofirpress ml_angelopoulos

Kimi-K2 Reasoner has been integrated into vLLM and will soon be supported by SGLang, featuring a massive 1.2 trillion parameter MoE configuration. Perplexity AI released research on cloud-portable trillion-parameter MoE kernels optimized for AWS EFA, with potential integration into vLLM. IBM's vLLM team formalized hybrid dense and sparse expert models, supporting models like Qwen3-Next, Nemotron Nano 2, and Granite 4.0. Kimi-K2 reportedly scores 77% on GPQA Diamond, outperforming GPT-4.5 at 71.4%, though this is unverified. Anthropic published a guide on efficient tool-heavy agent systems using MCP patterns, drastically reducing context tokens by ~98.7%. Graphiti MCP demonstrated shared memory across apps like Claude Desktop and Cursor for persistent agent memory. VS Code introduced an "Agent sessions" feature to unify agent management, including Copilot and Codex. Cursor AI improved coding accuracy via semantic search and code retrieval embeddings. New evaluation frameworks like CodeClash and LMArena assess agent and coding model performance in realistic multi-round tasks and occupation-tagged leaderboards.

Sep 15, 2025

GPT-5 Codex launch and OpenAI's quiet rise in Agentic Coding

gpt-5-codex qwen3-next-80b openai alibaba together-ai nvidia agentic-ai software-engineering long-context mixture-of-experts model-optimization cuda-acceleration inference-efficiency routing task-adaptive-thinking sama swyx omarsar0 ofirpress

OpenAI released GPT-5-Codex, an agentic coding model optimized for long-running software engineering tasks with dynamic task-adaptive thinking, multi-hour autonomy, and improved code quality. It achieves 51% accuracy on an unreleased large refactor benchmark and integrates deeply with developer tools like Xcode. Meanwhile, Alibaba launched Qwen3-Next-80B, a hybrid MoE model with native long-context support (262k tokens, extensible to 1M+), targeting efficient reasoning and repository-scale code analysis, supported by Together AI and NVIDIA with CUDA-accelerated attention. The trend towards hybrid SSM + MoE architectures is noted, emphasizing efficiency and scaling in China and US training regimes. Community discussions highlight the importance of variable compute and routing for inference efficiency and quality.

Aug 07, 2025

OpenAI rolls out GPT-5 and GPT-5 Thinking to >1B users worldwide; -mini and -nano help claim Pareto Frontier

gpt-5 gpt-5-mini gpt-5-nano claude-4.1-sonnet claude-4.1-opus openai cursor_ai jetbrains microsoft notion perplexity_ai factoryai model-architecture context-windows pricing-models coding long-context prompt-engineering model-benchmarking model-integration tool-use reasoning sama scaling01 jeffintime embirico mustafasuleyman cline lmarena_ai nrehiew_ ofirpress sauers_

OpenAI launched GPT-5, a unified system featuring a fast main model and a deeper thinking model with a real-time router, supporting up to 400K context length and aggressive pricing that reclaims the Pareto Frontier of Intelligence. The rollout includes variants like gpt-5-mini and gpt-5-nano with significant cost reductions, and integrations with products such as ChatGPT, Cursor AI, JetBrains AI Assistant, Microsoft Copilot, Notion AI, and Perplexity AI. Benchmarks show GPT-5 performing strongly in coding and long-context reasoning, roughly matching Claude 4.1 Sonnet/Opus on SWE-bench Verified. The launch was accompanied by a GPT-5 prompting cookbook and notable community discussions on pricing and performance.

Oct 11, 2024

not much happened today

aria o1-preview o1-mini gemini-1.5-pro gemini-1.5-flash gemini-1.5 claude-3.5-sonnet rhymes-ai openai anthropic google meta-ai-fair oxylabs multimodality mixture-of-experts long-context retrieval-augmented-generation benchmarking software-engineering llm-evaluation prompt-engineering web-scraping python production-applications mervenoyann osanseviero dbrxmosaicai ylecun ofirpress clefourrier omarsar0 rohanpaul_ai svpino finbarrtimbers _philschmid

Rhymes AI released Aria, a new 25.3B parameter multimodal MoE model supporting text, code, image, and video with a 64k token context window and Apache-2.0 license. OpenAI's o1-preview and o1-mini models show consistent improvement over Anthropic and Google Gemini 1.5 Pro/Flash on long context RAG benchmarks up to 128k tokens, while Google Gemini 1.5 models excel at extreme context lengths up to 2 million tokens. Meta AI expanded rollout to 21 countries with new language support but remains unavailable in the EU. The one-year anniversary of SWE-bench benchmark for software engineering tasks was celebrated, alongside the introduction of SWE-bench Multimodal. New AI tools include OxyCopilot by Oxylabs for web scraping, Taipy for Python-based production apps, and Latitude for prompt engineering. Industry insights highlight changing AI funding dynamics and OpenAI's strategic focus on consumer products like ChatGPT. "all recaps done by Claude 3.5 Sonnet, best of 4 runs."