subscribe / issues / tags /

Person: "dejavucoder"

not much happened today

claude-fable-5 mythos-5 gpt-5.5 claude-code fable-5 codex opus-4.8 kimi-k2.7-code anthropic artificial-analysis datacurve moonshot model-sovereignty export-controls coding-agent-evaluation benchmarking benchmark-gaming harness-quality benchmark-saturation open-source-models natolambert theo cohere kunchenguid clementdelangue dejavucoder ofirpress ramplabs

Anthropic suspended access to Claude Fable 5 and Mythos 5 due to US export controls, sparking a debate on model sovereignty and geopolitical risks for frontier AI vendors. Artificial Analysis updated its coding agent benchmark, replacing SWE-Bench Pro with DeepSWE, reshuffling rankings with Claude Code + Fable 5 [max] leading. Discussions highlighted the importance of harness quality versus pure model capability and concerns over benchmark saturation and realism. Additionally, Moonshot released the open-source model Kimi K2.7-Code.

not much happened today

claude-opus-4.6 capybara glm-5.1 qwen-3.5-14b qwen-27b qwen3.5-35b anthropic google zhipu model-scaling coding academic-reasoning cybersecurity quantization local-inference model-benchmarking inference-optimization model-performance agent-products scaling01 yuchenj_uw kimmonismus m1astra dejavucoder iscienceluvr gaoj0017

Anthropic is reportedly introducing a new AI model tier called Capybara, which is larger and more intelligent than Claude Opus 4.6, showing improved performance in coding, academic reasoning, and cybersecurity. The model is speculated to be around 10 trillion parameters, with Google potentially funding Anthropic's data center expansion. Meanwhile, Zhipu released GLM-5.1, advancing open coding models and narrowing the gap with closed models. Local inference economics are improving, highlighted by efficient deployments of Qwen 3.5 14B, Qwen 27B, and Qwen3.5-35B models with quantization techniques like TurboQuant vLLM. However, TurboQuant's benchmarking claims face criticism from researchers. Overall, the AI landscape shows aggressive scaling, local model deployment, and agent products gaining traction.

GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)

claude-4.1-opus gpt-5-high gptnext gemini-2.5-flash gemini-2.5-flash-lite deepseek-v3.1-terminus google-chirp-2 qwen-2.5b openai anthropic google nvidia artificial-analysis deepseek benchmarking agentic-ai tool-use long-context speech-to-text model-evaluation reasoning pricing model-performance kevinweil gdb dejavucoder yuchenj_uw lhsummers

OpenAI's Evals team released GDPval, a comprehensive evaluation benchmark covering 1,320 tasks across 44 predominantly digital occupations, assessing AI models against human experts with 14 years average experience. Early results show Claude 4.1 Opus outperforming human experts in most categories and GPT-5 high trailing behind, with projections that GPTnext could match human performance by mid-2026. The benchmark is positioned as a key metric for policymakers and labor impact forecasting. Additionally, Artificial Analysis reported improvements in Gemini 2.5 Flash/Flash-Lite and DeepSeek V3.1 Terminus models, alongside new speech-to-text benchmarks (AA-WER) highlighting leaders like Google Chirp 2 and NVIDIA Canary Qwen2.5B. Agentic AI advances include Kimi OK Computer, an OS-like agent with extended tool capabilities and new vendor verification tools.

© 2026 • AINews

You can also subscribe by rss .

Press Esc or click anywhere to close