All tags
Person: "sebastienbubeck"
not much happened today
qwen-3.7 claude-opus-4.6 gpt-5.5 mythos quest-2b-35b deepseek google-deepmind langchain-ai anthropic openai alibaba sakana-ai stanford oxford ai2 harness-engineering agent-infrastructure coding-benchmarks security-guidance long-horizon-memory context-compression sleep-phase math-problem-solving fact-seeking citation-grounding science-evaluation sebastienbubeck
Harness engineering is emerging as the key differentiator for coding agents, emphasizing the stack of model + harness + eval loop over just stronger base models. DeepSeek is building a harness team to optimize interaction and verification loops, while Google's Gemini Managed Agents and LangChain formalize harness concepts like context governance and dynamic skill routing. New benchmarks like DeepSWE align closely with real developer experience, with Qwen3.7 Max and Claude Opus 4.6 showing strong agentic coding performance. Anthropic introduced a security-guidance plugin for Claude Code reducing security PR comments by 30β40%, and OpenAI highlighted GPT-5.5 in Codex for improved document parsing. In research, Claude Mythos solved ErdΕs problem #90 with a cleaner proof path than previous models, showing latent capabilities unlocked by appropriate harnesses. The paper "Language Models Need Sleep" proposes a sleep-like consolidation phase for long-horizon memory, addressing bottlenecks in persistent context storage. Open research agents like QUEST (2Bβ35B parameters) advance long-horizon fact-seeking and citation grounding, while the CUSP benchmark from Sakana/Stanford/Oxford/AI2 evaluates current model capabilities in science.
OpenAI's gpt-oss 20B and 120B, Claude Opus 4.1, DeepMind Genie 3
gpt-oss-120b gpt-oss-20b gpt-oss claude-4.1-opus claude-4.1 genie-3 openai anthropic google-deepmind mixture-of-experts model-architecture agentic-ai model-training model-performance reasoning hallucination-detection gpu-optimization open-weight-models realtime-simulation sama rasbt sebastienbubeck polynoamial kaicathyc finbarrtimbers vikhyatk scaling01 teortaxestex
OpenAI released the gpt-oss family, including gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2, designed for agentic tasks and licensed under Apache 2.0. These models use a Mixture-of-Experts (MoE) architecture with wide vs. deep design and innovative features like bias units in attention and a unique swiglu variant. The 120B model was trained with about 2.1 million H100 GPU hours. Meanwhile, Anthropic launched claude-4.1-opus, touted as the best coding model currently. DeepMind showcased genie-3, a realtime world simulation model with minute-long consistency. The releases highlight advances in open-weight models, reasoning capabilities, and world simulation. Key figures like @sama, @rasbt, and @SebastienBubeck provided technical insights and performance evaluations, noting strengths and hallucination risks.