All tags
Model: "mythos"
not much happened today
qwen-3.7 claude-opus-4.6 gpt-5.5 mythos quest-2b-35b deepseek google-deepmind langchain-ai anthropic openai alibaba sakana-ai stanford oxford ai2 harness-engineering agent-infrastructure coding-benchmarks security-guidance long-horizon-memory context-compression sleep-phase math-problem-solving fact-seeking citation-grounding science-evaluation sebastienbubeck
Harness engineering is emerging as the key differentiator for coding agents, emphasizing the stack of model + harness + eval loop over just stronger base models. DeepSeek is building a harness team to optimize interaction and verification loops, while Google's Gemini Managed Agents and LangChain formalize harness concepts like context governance and dynamic skill routing. New benchmarks like DeepSWE align closely with real developer experience, with Qwen3.7 Max and Claude Opus 4.6 showing strong agentic coding performance. Anthropic introduced a security-guidance plugin for Claude Code reducing security PR comments by 30β40%, and OpenAI highlighted GPT-5.5 in Codex for improved document parsing. In research, Claude Mythos solved ErdΕs problem #90 with a cleaner proof path than previous models, showing latent capabilities unlocked by appropriate harnesses. The paper "Language Models Need Sleep" proposes a sleep-like consolidation phase for long-horizon memory, addressing bottlenecks in persistent context storage. Open research agents like QUEST (2Bβ35B parameters) advance long-horizon fact-seeking and citation grounding, while the CUSP benchmark from Sakana/Stanford/Oxford/AI2 evaluates current model capabilities in science.
not much happened today
mythos anthropic openai langchain nous-research cybersecurity sandboxing reinforcement-learning agent-architecture memory-management model-deployment software-security evaluation-methods kimmonismus paul_cal gneubig kentonvarda boazbaraktcs ylecun deanwball hwchase17 vtrivedy10 sarahcat21 aijoey
Anthropic's Mythos and OpenAI's upcoming restricted cyber-capable models are central to recent discussions, with debates on their security realism and evaluation methods. LangChain's Deep Agents deploy introduces an open memory, model-agnostic agent harness architecture emphasizing open protocols and memory ownership. Sandboxes are gaining prominence as a core infrastructure for reinforcement learning, with labs running up to 100K concurrent sandboxes aiming for 1M. The Hermes Agent by Nous continues to gain traction with new integrations and features like a web-based HUD and token cost tracking.