All tags
Topic: "coding-benchmarks"
not much happened today
gpt-oss-120b gpt-oss-20b kimi-k2 deepseek-r1 qwen-3-32b openai huggingface microsoft llamaindex ollama baseten fireworksai cerebras groq together anthropic google uk-aisi sliding-window-attention mixture-of-experts rope context-length mxfp4-format synthetic-data reasoning-core-hypothesis red-teaming benchmarking coding-benchmarks model-performance fine-tuning woj_zaremba sama huybery drjimfan jxmnop scaling01 arunv30 kevinweil xikun_zhang_ jerryjliu0 ollama basetenco reach_vb gneubig shxf0072 _lewtun
OpenAI released its first open models since GPT-2, gpt-oss-120b and gpt-oss-20b, which quickly trended on Hugging Face. Microsoft supports these models via Azure AI Foundry and Windows Foundry Local. Key architectural innovations include sliding window attention, mixture of experts (MoE), a RoPE variant, and a 256k context length. The models use a new MXFP4 format supported by llama.cpp. Hypotheses suggest gpt-oss was trained on synthetic data to enhance safety and performance, supporting the Reasoning Core Hypothesis. OpenAI announced a $500K bounty for red teaming with partners including Anthropic, Google, and the UK AISI. Performance critiques highlight inconsistent benchmarking results, with GPT-OSS-120B scoring 41.8% on the Aider Polyglot coding benchmark, trailing competitors like Kimi-K2 and DeepSeek-R1. Some users note the model excels in math and reasoning but lacks common sense and practical utility.
Claude 3.7 Sonnet
claude-3-7-sonnet claude-3 claude-code anthropic hybrid-reasoning extended-thinking coding-benchmarks agentic-ai prompt-caching streaming token-capacity tool-use
Anthropic launched Claude 3.7 Sonnet, their most intelligent model to date featuring hybrid reasoning with two thinking modes: near-instant and extended step-by-step thinking. The release includes Claude Code, an agentic coding tool in limited preview, and supports a 128k output token capability in beta. Claude 3.7 Sonnet performs well on coding benchmarks like SWE-Bench Verified and Cognition's junior-dev eval, and introduces advanced features such as streaming thinking, prompt caching, and tool use. The model is also benchmarked on Pokebench, reflecting agentic capabilities similar to the Voyager paper. The launch is accompanied by extensive documentation, cookbooks, and prompting guides for extended thinking. "The first generally available hybrid reasoning model" and "first coding tool from Anthropic" were highlighted in social media announcements.