Topic: "sample-efficiency"

claude-mythos opus-4.8 opus-4.7 gpt-5.5 gemini-3.1-pro gemini-3.5-flash claude-opus-4.7 anthropic sakana-ai meta-ai-fair princeton recursive-self-improvement benchmarking agent-evaluation long-horizon-tasks reliability reinforcement-learning sample-efficiency economically-meaningful-tasks agent-coherence anti-reward-hacking tooling rl-environments kimmonismus lechmazur teortaxestex hardmaru andrew_n_carr steverab pauliusztin_

Anthropic's Mythos/Opus cycle sparked mixed reactions with praise for Claude Mythos's one-shot workflows and concerns over Opus 4.8 benchmark regressions. Opus 4.7 showed strong chemistry task performance, "making Claude a chemist." Sakana AI launched an RSI Lab focusing on recursive self-improvement under compute constraints, marking RSI as a formal research program. New benchmarks like Agents' Last Exam (ALE) and SWE-Marathon test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. Princeton's ICML 2026 paper found models like GPT 5.5, Gemini 3.1 Pro / 3.5 Flash, and Claude Opus 4.7 still lack meaningful reliability improvements. Tooling trends favor RL-environment-style frameworks for agent evaluation, exemplified by Meta's OpenEnv.

You can also subscribe by rss .

Press Esc or click anywhere to close