All tags
Person: "bcherny"
not much happened today
opus-4.8 gemma-4 cognition frontiercode moonshot google claudedevs magicpath langsmith modal coding-evaluation agent-control verification agent-ergonomics sandbox-environments local-inference workflow-optimization cli-tools plugin-integration persistent-memory swyx dzhng claudecode bcherny reach_vb omarsar0 gneubig hamelhusain angaisb_
FrontierCode benchmark by Cognition highlights the challenge of coding tasks with the best model, Opus 4.8, scoring only about 13% on the hardest subset, indicating coding is less solved than benchmarks suggest. The trend toward using loops as a control metaphor for coding agents is prominent, with emphasis on clear goals, verification, and iteration, though some experts caution about overreliance on loops. Agent ergonomics are improving with observability dashboards, sandbox environments, and workflow tools from ClaudeDevs, MagicPath, LangSmith, and Modal. Kimi by Moonshot released major updates including a stronger coding agent and a desktop agent product supporting up to 300 local sub-agents. Google advanced efficient local deployment with upgrades to Gemma 4 checkpoints.
Anthropic's Claude Opus 4.7
claude-opus-4.7 codex gpt-rosalind anthropic openai cursor replit perplexity-ai microsoft coding agentic-ai tokenization long-context benchmarking image-processing software-engineering computer-use plugin-integration multi-terminal-support ssh-access model-expansion bcherny kimmonismus scaling01 valsai artificialanlys natolambert nrehiew_
Anthropic launched Claude Opus 4.7, its most capable Opus model yet, featuring stronger coding and agentic performance, a new tokenizer, and improved long-context handling with a new xhigh reasoning tier. Benchmarks show substantial gains, including SWE-bench Pro 64.3%, SWE-bench Verified 87.6%, and TerminalBench 69.4%, with top rankings on Vals Index and GDPval-AA. Technical changes include a new tokenizer and increased image input resolution to 3.75MP. Some long-context benchmarks showed mixed results, with a shift in focus from MRCR to Graphwalks. Adoption was rapid across tools like Cursor, VS Code, Replit Agent, and Perplexity. Meanwhile, OpenAI expanded Codex into a broader computer agent with Mac computer use, in-app browser, image generation/editing, 90+ plugins, multi-terminal support, SSH remote devbox access, and richer file previews. A new vertical life-sciences model, GPT-Rosalind, was also introduced.
not much happened today
opus-4.6 glm-5 anthropic ibm perplexity-ai llamaindex deepseek google-chrome persistent-memory agent-infrastructure cross-device-synchronization long-context sparse-attention inference-optimization computer-architecture task-completion systems-performance pamelafox tadasayy llama_index bromann dair_ai omarsar0 abxxai teknuim bcherny kimmonismus _catwu alexalbert__ realyushibai
MCP tools remain relevant for deterministic APIs despite ergonomic criticisms, with new web MCP support in Chrome v146 enabling continuous browsing agents. Persistent memory is emerging as a key differentiator for agents, with IBM improving task completion rates and multi-agent memory framed as a computer architecture challenge. Agent UX is evolving towards always-on, cross-device operation, exemplified by Perplexity Computer on iOS and Claude Code session management. Anthropic released Opus 4.6 1M context as default with no extra long-context API charges, achieving 78.3% on MRCR v2 at 1M tokens. Sparse attention optimizations like IndexCache in DeepSeek Sparse Attention yield significant speedups on large models with minimal code changes.