All tags
Model: "gemini-3.5-flash"
not much happened today
glm-5.2 glm-5.2-max opus-4.8 claude-fable-5 ornith-1.0 gemma-4 qwen-3.5 lfm2.5-230m gemini-3.5-flash codex z.ai databricks liquid-ai google-deepmind google sail hyperagent openai langchain coding-benchmarks agentic-ai reinforcement-learning model-optimization speculative-decoding hardware-optimization long-running-agents agent-persistence cost-efficiency computer-use safety-controls developer-tools token-consumption concurrent-agents philschmid gdb reach_vb eliebakouch
Z.ai's GLM-5.2 leads in coding and agent benchmarks with top scores like 1595 on Code Arena: Frontend and 34.29% reasoning accuracy with zero failures. Databricks improved GLM-5.2 speed to 392 tok/s using hardware and optimizations. Ornith-1.0, a new MIT-licensed coding model family, spans 9B to 397B parameters with strong benchmark results and a self-improving RL training method. Liquid AI released a small model for low-latency robotics/e-commerce use. Google integrated computer use into Gemini 3.5 Flash with safety controls and developer tools for device control. Startups like Sail and Hyperagent focus on long-running agents with persistent execution and cost efficiency. OpenAI reports growing internal Codex use for complex, cross-functional tasks, highlighting agent skill concurrency.
not much happened today
claude-mythos opus-4.8 opus-4.7 gpt-5.5 gemini-3.1-pro gemini-3.5-flash claude-opus-4.7 anthropic sakana-ai meta-ai-fair princeton recursive-self-improvement benchmarking agent-evaluation long-horizon-tasks reliability reinforcement-learning sample-efficiency economically-meaningful-tasks agent-coherence anti-reward-hacking tooling rl-environments kimmonismus lechmazur teortaxestex hardmaru andrew_n_carr steverab pauliusztin_
Anthropic's Mythos/Opus cycle sparked mixed reactions with praise for Claude Mythos's one-shot workflows and concerns over Opus 4.8 benchmark regressions. Opus 4.7 showed strong chemistry task performance, "making Claude a chemist." Sakana AI launched an RSI Lab focusing on recursive self-improvement under compute constraints, marking RSI as a formal research program. New benchmarks like Agents' Last Exam (ALE) and SWE-Marathon test agents on long-horizon, economically meaningful tasks, revealing low pass rates and coherence challenges. Princeton's ICML 2026 paper found models like GPT 5.5, Gemini 3.1 Pro / 3.5 Flash, and Claude Opus 4.7 still lack meaningful reliability improvements. Tooling trends favor RL-environment-style frameworks for agent evaluation, exemplified by Meta's OpenEnv.
Google I/O 2026: Gemini 3.5 Flash, Omni, and Googleโs Agent Stack
gemini-3.5-flash gemini-3.1-pro gemini-3.5 gemini-omni google google-deepmind geminiapp agentic-ai multimodality video-generation model-performance benchmarking context-windows model-optimization model-scaling instruction-following api model-efficiency cost-analysis philschmid jeffdean
Google announced at I/O the repositioning of Gemini as a consumer AI and developer/agent platform with three key releases: Gemini 3.5 Flash for fast agentic and coding tasks, Gemini Omni for multimodal generation and editing including video, and the expanded Antigravity 2.0 agent stack. Google reports processing over 3.2 quadrillion tokens per month, a 7x increase year-over-year, with 900M+ monthly Gemini users across 230+ countries and 70+ languages. Gemini 3.5 Flash features a 1M-token context window, 65k max output tokens, 4 thinking levels, and "thought preservation" across turns, outperforming Gemini 3.1 Pro on multiple benchmarks and running up to 12x faster in Antigravity. Independent benchmarks show Gemini 3.5 Flash scoring 55 on the Intelligence Index, with higher costs than previous versions. Gemini Omni Flash supports text, image, video, and audio inputs for generative media tasks, available now for paid users.
not much happened today
codex deepseek-v4-pro gemini-3.5-flash gemini-3.1-pro gpt-5.5 claude-opus-4.7 openai claude deepseek gemini qwen model-performance cost-curves agent-products workflow-optimization product-differentiation benchmarking model-optimization gdb dzhng signulll teortaxestex ajambrosino reach_vb theo claudedevs _mohansolo artificialanlys scaling01 yuchenj_uw kimmonismus officiallogank designarena alezander907 giffmana jeremyphoward hamelhusain
AI News for 5/4/2026-5/5/2026 highlights a shift in AI product development emphasizing model + harness + workflow + UI + memory + economics over model quality alone, with notable updates from OpenAI Codex and Claude including new features like Appshots, auto mode, and Sonnet 4.6. DeepSeek made a significant market impact by permanently discounting DeepSeek-V4-Pro by 75%, drastically improving cost/performance ratios compared to Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.7. Meanwhile, Gemini 3.5 Flash showed benchmark improvements but received mixed feedback on practical utility. The competitive landscape continues to tighten with Qwen and other Chinese frontier models.