All tags
Model: "gpt-5-high"
DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling
deepseek-v3.2 deepseek-v3.2-speciale gpt-5-high sonnet-4.5 gemini-3-pro deepseek_ai lm-arena agentic-ai reinforcement-learning large-context-windows model-benchmarking model-performance multi-agent-systems model-training model-deployment suchenzang teortaxestex
DeepSeek launched the DeepSeek V3.2 family including Standard, Thinking, and Speciale variants with up to 131K context window and competitive benchmarks against GPT-5-High, Sonnet 4.5, and Gemini 3 Pro. The release features a novel Large Scale Agentic Task Synthesis Pipeline focusing on agentic behaviors and improvements in reinforcement learning post-training algorithms. The models are available on platforms like LM Arena with pricing around $0.28/$0.42 per million tokens. Community feedback is mixed, praising the frontier reasoning capabilities but critiquing the chat UI experience. Key figures include Susan Zhang and Teortaxes who provided commentary on the release.
Sora 2: new video+audio model and OpenAI's first Social Network
sora-2 claude-4.5-sonnet gpt-5-high openai anthropic video-generation character-consistency social-networks agentic-ai token-efficiency benchmarking model-performance context-management coding math sama
Sora 2 released with improvements on physical world video modeling and a new "character consistency" feature allowing real-world element injection from a single video. The model powers a new Sora social network app with profiles, DMs, and viral videos, emphasizing user control over likeness use. OpenAI employees are actively experimenting with the model. Meanwhile, Anthropic launched Claude 4.5 Sonnet with enhanced intelligence, token efficiency, and agentic tool use, outperforming some competitors and closely tracking GPT-5-high on benchmarks. Ecosystem support includes LangSmith integration and strong coding/math benchmark results.
GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)
claude-4.1-opus gpt-5-high gptnext gemini-2.5-flash gemini-2.5-flash-lite deepseek-v3.1-terminus google-chirp-2 qwen-2.5b openai anthropic google nvidia artificial-analysis deepseek benchmarking agentic-ai tool-use long-context speech-to-text model-evaluation reasoning pricing model-performance kevinweil gdb dejavucoder yuchenj_uw lhsummers
OpenAI's Evals team released GDPval, a comprehensive evaluation benchmark covering 1,320 tasks across 44 predominantly digital occupations, assessing AI models against human experts with 14 years average experience. Early results show Claude 4.1 Opus outperforming human experts in most categories and GPT-5 high trailing behind, with projections that GPTnext could match human performance by mid-2026. The benchmark is positioned as a key metric for policymakers and labor impact forecasting. Additionally, Artificial Analysis reported improvements in Gemini 2.5 Flash/Flash-Lite and DeepSeek V3.1 Terminus models, alongside new speech-to-text benchmarks (AA-WER) highlighting leaders like Google Chirp 2 and NVIDIA Canary Qwen2.5B. Agentic AI advances include Kimi OK Computer, an OS-like agent with extended tool capabilities and new vendor verification tools.
not much happened today
gpt-5 gpt-5-high gpt-5-mini-high gpt-5-nano-high imagen-4 gemma-3-270m openai google lmsys model-releases model-performance prompt-engineering developer-tools image-generation model-optimization transformers tokenization model-scaling sama aidan_mclau kevinweil lmarena_ai edwinarbus gdb omarsar0 philschmid m4rkmc
OpenAI rolled out GPT-5 as the default in ChatGPT with new modes and a "warmer" personality, plus expanded message limits for Plus/Team users and Enterprise/Edu access. Performance rankings show gpt-5-high leading, with smaller variants also ranked, though critiques note some underperformance versus Chinese models and sensitivity to sycophancy. OpenAI enhanced developer tools with a "Quick eval" feature, coding tips, and an improved Playground. Google released Imagen 4 generally available with faster generation and higher resolution, plus the ultra-small Gemma 3 270M model with a large vocabulary and ecosystem support. Podcasts featured OpenAI leaders discussing GPT-5 systems, routing, and efficiency.