All tags
Person: "dejavucoder"
not much happened today
claude-opus-4.6 capybara glm-5.1 qwen-3.5-14b qwen-27b qwen3.5-35b anthropic google zhipu model-scaling coding academic-reasoning cybersecurity quantization local-inference model-benchmarking inference-optimization model-performance agent-products scaling01 yuchenj_uw kimmonismus m1astra dejavucoder iscienceluvr gaoj0017
Anthropic is reportedly introducing a new AI model tier called Capybara, which is larger and more intelligent than Claude Opus 4.6, showing improved performance in coding, academic reasoning, and cybersecurity. The model is speculated to be around 10 trillion parameters, with Google potentially funding Anthropic's data center expansion. Meanwhile, Zhipu released GLM-5.1, advancing open coding models and narrowing the gap with closed models. Local inference economics are improving, highlighted by efficient deployments of Qwen 3.5 14B, Qwen 27B, and Qwen3.5-35B models with quantization techniques like TurboQuant vLLM. However, TurboQuant's benchmarking claims face criticism from researchers. Overall, the AI landscape shows aggressive scaling, local model deployment, and agent products gaining traction.
GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)
claude-4.1-opus gpt-5-high gptnext gemini-2.5-flash gemini-2.5-flash-lite deepseek-v3.1-terminus google-chirp-2 qwen-2.5b openai anthropic google nvidia artificial-analysis deepseek benchmarking agentic-ai tool-use long-context speech-to-text model-evaluation reasoning pricing model-performance kevinweil gdb dejavucoder yuchenj_uw lhsummers
OpenAI's Evals team released GDPval, a comprehensive evaluation benchmark covering 1,320 tasks across 44 predominantly digital occupations, assessing AI models against human experts with 14 years average experience. Early results show Claude 4.1 Opus outperforming human experts in most categories and GPT-5 high trailing behind, with projections that GPTnext could match human performance by mid-2026. The benchmark is positioned as a key metric for policymakers and labor impact forecasting. Additionally, Artificial Analysis reported improvements in Gemini 2.5 Flash/Flash-Lite and DeepSeek V3.1 Terminus models, alongside new speech-to-text benchmarks (AA-WER) highlighting leaders like Google Chirp 2 and NVIDIA Canary Qwen2.5B. Agentic AI advances include Kimi OK Computer, an OS-like agent with extended tool capabilities and new vendor verification tools.