All tags
Model: "qwen-2.5b"
GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)
claude-4.1-opus gpt-5-high gptnext gemini-2.5-flash gemini-2.5-flash-lite deepseek-v3.1-terminus google-chirp-2 qwen-2.5b openai anthropic google nvidia artificial-analysis deepseek benchmarking agentic-ai tool-use long-context speech-to-text model-evaluation reasoning pricing model-performance kevinweil gdb dejavucoder yuchenj_uw lhsummers
OpenAI's Evals team released GDPval, a comprehensive evaluation benchmark covering 1,320 tasks across 44 predominantly digital occupations, assessing AI models against human experts with 14 years average experience. Early results show Claude 4.1 Opus outperforming human experts in most categories and GPT-5 high trailing behind, with projections that GPTnext could match human performance by mid-2026. The benchmark is positioned as a key metric for policymakers and labor impact forecasting. Additionally, Artificial Analysis reported improvements in Gemini 2.5 Flash/Flash-Lite and DeepSeek V3.1 Terminus models, alongside new speech-to-text benchmarks (AA-WER) highlighting leaders like Google Chirp 2 and NVIDIA Canary Qwen2.5B. Agentic AI advances include Kimi OK Computer, an OS-like agent with extended tool capabilities and new vendor verification tools.