All tags
Model: "gemini-3.5-flash"
not much happened today
claude-mythos opus-4.8 opus-4.7 gpt-5.5 gemini-3.1-pro gemini-3.5-flash claude-opus-4.7 anthropic sakana-ai meta-ai-fair princeton recursive-self-improvement benchmarking agent-evaluation long-horizon-tasks reliability reinforcement-learning sample-efficiency economically-meaningful-tasks agent-coherence anti-reward-hacking tooling rl-environments kimmonismus lechmazur teortaxestex hardmaru andrew_n_carr steverab pauliusztin_
Anthropic's Mythos/Opus cycle sparked mixed reactions with praise for Claude Mythos's one-shot workflows and concerns over Opus 4.8 benchmark regressions. Opus 4.7 showed strong chemistry task performance, "making Claude a chemist." Sakana AI launched an RSI Lab focusing on recursive self-improvement under compute constraints, marking RSI as a formal research program. New benchmarks like Agents' Last Exam (ALE) and SWE-Marathon test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. Princeton's ICML 2026 paper found models like GPT 5.5, Gemini 3.1 Pro / 3.5 Flash, and Claude Opus 4.7 still lack meaningful reliability improvements. Tooling trends favor RL-environment-style frameworks for agent evaluation, exemplified by Meta's OpenEnv.
Google I/O 2026: Gemini 3.5 Flash, Omni, and Googleโs Agent Stack
gemini-3.5-flash gemini-3.1-pro gemini-3.5 gemini-omni google google-deepmind geminiapp agentic-ai multimodality video-generation model-performance benchmarking context-windows model-optimization model-scaling instruction-following api model-efficiency cost-analysis philschmid jeffdean
Google announced at I/O the repositioning of Gemini as a consumer AI and developer/agent platform with three key releases: Gemini 3.5 Flash for fast agentic and coding tasks, Gemini Omni for multimodal generation and editing including video, and the expanded Antigravity 2.0 agent stack. Google reports processing over 3.2 quadrillion tokens per month, a 7x increase year-over-year, with 900M+ monthly Gemini users across 230+ countries and 70+ languages. Gemini 3.5 Flash features a 1M-token context window, 65k max output tokens, 4 thinking levels, and "thought preservation" across turns, outperforming Gemini 3.1 Pro on multiple benchmarks and running up to 12x faster in Antigravity. Independent benchmarks show Gemini 3.5 Flash scoring 55 on the Intelligence Index, with higher costs than previous versions. Gemini Omni Flash supports text, image, video, and audio inputs for generative media tasks, available now for paid users.
not much happened today
codex deepseek-v4-pro gemini-3.5-flash gemini-3.1-pro gpt-5.5 claude-opus-4.7 openai claude deepseek gemini qwen model-performance cost-curves agent-products workflow-optimization product-differentiation benchmarking model-optimization gdb dzhng signulll teortaxestex ajambrosino reach_vb theo claudedevs _mohansolo artificialanlys scaling01 yuchenj_uw kimmonismus officiallogank designarena alezander907 giffmana jeremyphoward hamelhusain
AI News for 5/4/2026-5/5/2026 highlights a shift in AI product development emphasizing model + harness + workflow + UI + memory + economics over model quality alone, with notable updates from OpenAI Codex and Claude including new features like Appshots, auto mode, and Sonnet 4.6. DeepSeek made a significant market impact by permanently discounting DeepSeek-V4-Pro by 75%, drastically improving cost/performance ratios compared to Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.7. Meanwhile, Gemini 3.5 Flash showed benchmark improvements but received mixed feedback on practical utility. The competitive landscape continues to tighten with Qwen and other Chinese frontier models.