All tags
Topic: "coding-benchmarks"
not much happened today
qwen-3.7 claude-opus-4.6 gpt-5.5 mythos quest-2b-35b deepseek google-deepmind langchain-ai anthropic openai alibaba sakana-ai stanford oxford ai2 harness-engineering agent-infrastructure coding-benchmarks security-guidance long-horizon-memory context-compression sleep-phase math-problem-solving fact-seeking citation-grounding science-evaluation sebastienbubeck
Harness engineering is emerging as the key differentiator for coding agents, emphasizing the stack of model + harness + eval loop over just stronger base models. DeepSeek is building a harness team to optimize interaction and verification loops, while Google's Gemini Managed Agents and LangChain formalize harness concepts like context governance and dynamic skill routing. New benchmarks like DeepSWE align closely with real developer experience, with Qwen3.7 Max and Claude Opus 4.6 showing strong agentic coding performance. Anthropic introduced a security-guidance plugin for Claude Code reducing security PR comments by 30β40%, and OpenAI highlighted GPT-5.5 in Codex for improved document parsing. In research, Claude Mythos solved ErdΕs problem #90 with a cleaner proof path than previous models, showing latent capabilities unlocked by appropriate harnesses. The paper "Language Models Need Sleep" proposes a sleep-like consolidation phase for long-horizon memory, addressing bottlenecks in persistent context storage. Open research agents like QUEST (2Bβ35B parameters) advance long-horizon fact-seeking and citation grounding, while the CUSP benchmark from Sakana/Stanford/Oxford/AI2 evaluates current model capabilities in science.
not much happened today
gpt-oss-120b gpt-oss-20b kimi-k2 deepseek-r1 qwen-3-32b openai huggingface microsoft llamaindex ollama baseten fireworksai cerebras groq together anthropic google uk-aisi sliding-window-attention mixture-of-experts rope context-length mxfp4-format synthetic-data reasoning-core-hypothesis red-teaming benchmarking coding-benchmarks model-performance fine-tuning woj_zaremba sama huybery drjimfan jxmnop scaling01 arunv30 kevinweil xikun_zhang_ jerryjliu0 ollama basetenco reach_vb gneubig shxf0072 _lewtun
OpenAI released its first open models since GPT-2, gpt-oss-120b and gpt-oss-20b, which quickly trended on Hugging Face. Microsoft supports these models via Azure AI Foundry and Windows Foundry Local. Key architectural innovations include sliding window attention, mixture of experts (MoE), a RoPE variant, and a 256k context length. The models use a new MXFP4 format supported by llama.cpp. Hypotheses suggest gpt-oss was trained on synthetic data to enhance safety and performance, supporting the Reasoning Core Hypothesis. OpenAI announced a $500K bounty for red teaming with partners including Anthropic, Google, and the UK AISI. Performance critiques highlight inconsistent benchmarking results, with GPT-OSS-120B scoring 41.8% on the Aider Polyglot coding benchmark, trailing competitors like Kimi-K2 and DeepSeek-R1. Some users note the model excels in math and reasoning but lacks common sense and practical utility.
Claude 3.7 Sonnet
claude-3-7-sonnet claude-3 claude-code anthropic hybrid-reasoning extended-thinking coding-benchmarks agentic-ai prompt-caching streaming token-capacity tool-use
Anthropic launched Claude 3.7 Sonnet, their most intelligent model to date featuring hybrid reasoning with two thinking modes: near-instant and extended step-by-step thinking. The release includes Claude Code, an agentic coding tool in limited preview, and supports a 128k output token capability in beta. Claude 3.7 Sonnet performs well on coding benchmarks like SWE-Bench Verified and Cognition's junior-dev eval, and introduces advanced features such as streaming thinking, prompt caching, and tool use. The model is also benchmarked on Pokebench, reflecting agentic capabilities similar to the Voyager paper. The launch is accompanied by extensive documentation, cookbooks, and prompting guides for extended thinking. "The first generally available hybrid reasoning model" and "first coding tool from Anthropic" were highlighted in social media announcements.