Frozen AI News archive

GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)

**OpenAI**'s Evals team released **GDPval**, a comprehensive evaluation benchmark covering 1,320 tasks across 44 predominantly digital occupations, assessing AI models against human experts with 14 years average experience. Early results show **Claude 4.1 Opus** outperforming human experts in most categories and **GPT-5 high** trailing behind, with projections that **GPTnext** could match human performance by mid-2026. The benchmark is positioned as a key metric for policymakers and labor impact forecasting. Additionally, **Artificial Analysis** reported improvements in **Gemini 2.5 Flash/Flash-Lite** and **DeepSeek V3.1 Terminus** models, alongside new speech-to-text benchmarks (AA-WER) highlighting leaders like **Google Chirp 2** and **NVIDIA Canary Qwen2.5B**. Agentic AI advances include **Kimi OK Computer**, an OS-like agent with extended tool capabilities and new vendor verification tools.

Canonical issue URL

we are so close!

AI News for 9/24/2025-9/25/2025. We checked 12 subreddits, 544 Twitters and 23 Discords (194 channels, and 5737 messages) for you. Estimated reading time saved (at 200wpm): 472 minutes. Our new website is now up with full metadata search and beautiful vibe coded presentation of all past issues. See https://news.smol.ai/ for the full news breakdowns and give us feedback on @smol_ai!

OpenAI's Evals team is back for a third time this year with GDPVal, which they are framing as a logical next step in model evals with the breadth of MMLU, but with the depth of agentic benchmarks like SWE-Bench and their own SWE-Lancer. GDPval (full paper here) takes its name from a top down selection of major (>5%) sectors of GDP, filtered for "predominantly digital" knowledge work:

This resulted in 1,320 tasks across 44 occupations, which were then evaluated against models and human experts averaging 14 years of experience in those fields:

The two primary results charts are hugely validating: first that OpenAI doesn't bias towards itseslf, and that Opus is within spitting distance of industry expert output:

and the model trendlines over time have GPTnext matching human performance roughly by mid 2026:

The word AGI isn't mentioned at all in the paper, but the original 2018? OpenAI charter defined AGI as "highly autonomous systems that outperform humans at most economically valuable work”. If we were to wake up in Sept 2026 and find that GPT6-high-ultrathink-final-for-realsies was above confidence interval of 50% in GDPVal pairwise comparisons, then we could truly say that we have achieved AGI by 2018 standards.


AI Twitter Recap

OpenAI’s GDPval and the state of real‑world evals

Agentic coding and productized agents

Video reasoning and robotics

Model and method releases

Systems, serving, and infra

Industry moves and platform updates

Top tweets (by engagement)


AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. China AI Model Launches: Alibaba Qwen Extreme-Scaling Roadmap & Tencent Hunyuan Image 3.0

2. Local AI Alternatives: Fenghua No.3 CUDA/DirectX GPU + Post-Abliteration Uncensored LLM Finetunes

Less Technical AI Subreddit Recap

/r/Singularity, /r/Oobabooga, /r/MachineLearning, /r/OpenAI, /r/ClaudeAI, /r/StableDiffusion, /r/ChatGPT, /r/ChatGPTCoding, /r/aivideo, /r/aivideo

1. Gemini Robotics 1.5 and Veo 3 Zero‑Shot Video Reasoning

2. LLM Reasoning Reliability: Apple vs Anthropic and GPT‑5 Regression Reports

3. AI Industry Shifts: Anthropic’s New‑Grad Hiring Stance and China’s Fenghua No.3 GPU


AI Discord Recap

A summary of Summaries of Summaries by gpt-5

1. Agent Tooling: Chrome DevTools MCP and Perplexity Search API

2. Code World Models & Agent Execution Infra

3. GPU Systems & Diffusion Scale-Ups

4. Evaluations and Proactive Assistants

5. Training Tricks: Losses, Merges, and Data

gpt-5-mini

1. Model launches & leaderboard shakeups

2. Image‑generation arms race & inference tooling

3. Training, fine‑tuning, and experiment tooling

4. APIs, infra, and remote execution

5. Agent‑first products & one‑click deployers


Discord: High level Discord summaries

LMArena Discord


Unsloth AI (Daniel Han) Discord


Perplexity AI Discord


OpenRouter Discord


Cursor Community Discord


LM Studio Discord


OpenAI Discord


HuggingFace Discord


Moonshot AI (Kimi K-2) Discord


GPU MODE Discord


Yannick Kilcher Discord


Latent Space Discord


Eleuther Discord


Nous Research AI Discord


DSPy Discord


aider (Paul Gauthier) Discord


Manus.im Discord Discord


MCP Contributors (Official) Discord


tinygrad (George Hotz) Discord


MLOps @Chipro Discord


Windsurf Discord


The LLM Agents (Berkeley MOOC) Discord has no new messages. If this guild has been quiet for too long, let us know and we will remove it.


You are receiving this email because you opted in via our site.

Want to change how you receive these emails? You can unsubscribe from this list.


Discord: Detailed by-Channel summaries and links

LMArena ▷ #general (1051 messages🔥🔥🔥):

GPT-5 Codex arrival, Qwen image editor, Gemini Flash vs nano banana, DeepSeek models


LMArena ▷ #announcements (4 messages):

Qwen3 models, GPT-5 Codex, Seedream-4-2k, Gemini 2.5 Flash


Unsloth AI (Daniel Han) ▷ #general (450 messages🔥🔥🔥):

GPT OSS 120B finetuning errors, MuonClip in Unsloth, Overabundance of information, AI safety research and Unsloth, P100 GPUs for finetuning


Unsloth AI (Daniel Han) ▷ #off-topic (342 messages🔥🔥):

Eval dataset size, eval/loss graph, GPU Recommendations, Gemini Pro degradation, TrainerCallback functions


Unsloth AI (Daniel Han) ▷ #help (110 messages🔥🔥):

Runpod Access, Llama 3 vs Gemma, Qwen 2.5 VL finetuning, Saving 120b models, Multi GPU Training


Unsloth AI (Daniel Han) ▷ #research (14 messages🔥):

Neurocognitive Modeling, Tversky Implementation, Vibe Check Network, XOR Function Verification, Tversky Parameters vs. Traditional NN


Perplexity AI ▷ #announcements (1 messages):

Perplexity Search API, LLMs, Sonar, SDK Integration


Perplexity AI ▷ #general (832 messages🔥🔥🔥):

Airtel free premium, Qwen vs. Gemini, Perplexity image generation quota, Comet Stuttering, DeepSeek Terminus


Perplexity AI ▷ #sharing (7 messages):

Carl Sagan, 3I/ATLAS, Perplexity's Myspace Page, Arc Browser, Grogu


Perplexity AI ▷ #pplx-api (14 messages🔥):

Python SDK broken for streaming, Perplexity new Search API playground, Sonar vs Search API, Sonar charges


OpenRouter ▷ #announcements (1 messages):

Accidental price change, Refunds issued, Additional validations implemented


OpenRouter ▷ #general (567 messages🔥🔥🔥):

Horizon Alpha, Dirty Talk Models, Zenith Sigma, Grok's Storywriting, Distilled Models


OpenRouter ▷ #new-models (1 messages):

Readybot.io: OpenRouter - New Models


OpenRouter ▷ #discussion (58 messages🔥🔥):

Volume Discounts for OpenRouter, Microsoft 365 Copilot & Claude, Gemini-cli, Discussion and Helper Roles, Meta's CWM Model


Cursor Community ▷ #general (509 messages🔥🔥🔥):

MCP Server, Exa-ai, Context7, Generate Commit Message Language, Scroll to Bottom in the Chat Window


LM Studio ▷ #announcements (1 messages):

LM Studio 0.3.27 Release, Chat Search Functionality, Chat Sorting Options, Dry Run Load Resource Estimate


LM Studio ▷ #general (175 messages🔥🔥):

Linux plugins, Ollama Fine Tuning, Training vs RAG, LM Studio token count, LM Studio update failing


LM Studio ▷ #hardware-discussion (161 messages🔥🔥):

Budget GPUs for local models, Tesla K80s for AI, Intel Arc A770 for multi-GPU setups, Strix Halo vs Mac for AI, Nvidia 5090 pricing


OpenAI ▷ #annnouncements (2 messages):

GDPval, ChatGPT Pulse


OpenAI ▷ #ai-discussions (188 messages🔥🔥):

GPT-5-Mini Common Sense, Censored GPT-OSS-20B, Suno V5 vs Napster, AI Rocket League bot, Google Gemini 2.5 Flash release


OpenAI ▷ #gpt-4-discussions (2 messages):

ChatGPT Default State, ChatGPT Mode-Locking, ChatGPT Reset Command, ChatGPT performance degradation


OpenAI ▷ #prompt-engineering (29 messages🔥):

Chain of Thought Prompting, Model Translation Performance, Essay Generation from a Surfer's POV, Interactive Prompting Infographic, Model Self-Evaluation Techniques


OpenAI ▷ #api-discussions (29 messages🔥):

Chain-of-Thought Prompting, Quality Translation, Model Performance, React component (Tailwind + shadcn/ui + Recharts + lucide)


HuggingFace ▷ #general (135 messages🔥🔥):

Duolingo deletion, LinkedIn posting strategies, HF Blog post on AI, HF Discuss forum issues, LAION-2B-en dataset reading


HuggingFace ▷ #today-im-learning (1 messages):

GPU, Monitor, Drivers, Windows, Linux


HuggingFace ▷ #cool-finds (2 messages):

UIUC Finance Project, Trade Bench Insights


HuggingFace ▷ #i-made-this (1 messages):

Vendor lock-in in AI Chatbots, AI Chatbot Supporting Multiple Providers, Marketing Tools for Small Studios and Solo Devs


HuggingFace ▷ #reading-group (2 messages):

Diffusion Models, Generative AI, ELBO-based models, VAEs, Variational Diffusion Models (VDMs)


HuggingFace ▷ #core-announcements (1 messages):

Context-Parallelism, Diffusion Inference, Distributed Attention, Ring & Ulysses


HuggingFace ▷ #computer-vision (2 messages):

Topological Data Analysis, Persistent Images, Loss Functions


HuggingFace ▷ #smol-course (30 messages🔥):

Certificate issues and quiz completion, License and usage of the fine-tuning course, Smoltalk2 dataset size warning, HF Jobs permissions and authentication, Colab compatibility for the course


HuggingFace ▷ #agents-course (1 messages):

0xobito404: Hello from Thailand, starting the course rn


Moonshot AI (Kimi K-2) ▷ #announcements (1 messages):

OK Computer, Agent Mode, Multimedia Generation, Team-Level Polish, One-Link Deploy


Moonshot AI (Kimi K-2) ▷ #general-chat (147 messages🔥🔥):

Kimi Mini version, Moonshot team goals, Qwen model distillation, Kimi Computer agent, OpenAI compute


GPU MODE ▷ #general (15 messages🔥):

Hopper TMA, Modal carrying code agent rollouts, MI300 support on Modal, Llama3.3 70B Prefill vs Decode


GPU MODE ▷ #triton (1 messages):

Triton pyproject.toml, uv add pip command


GPU MODE ▷ #cuda (13 messages🔥):

NCU profiling for SMEM bank conflicts, CUDA headers not being automatically included, WMMA kernel throwing unspecified launch failure, TMA minimum matmul kernel, Learning CUDA with limited hardware


GPU MODE ▷ #torch (16 messages🔥):

torchrun API, HF transformers static cache, CUDA streams in HF transformers, GraphMend for PyTorch 2


GPU MODE ▷ #cool-links (12 messages🔥):

CUDA, Triton, PTX memory consistency, Formal languages, GPU programming


GPU MODE ▷ #jobs (2 messages):

zml github


GPU MODE ▷ #beginner (8 messages🔥):

Inter-warp and intra-warp ops in NVIDIA GPUs, Independent thread scheduling, Multi-CTA matmul, GPGPU architecture, PMPP reading group


GPU MODE ▷ #pmpp-book (1 messages):

PTX, Triton, NCCL, NCU profiling, PTX memory fencing


GPU MODE ▷ #triton-puzzles (1 messages):

puzzle difficulty


GPU MODE ▷ #rocm (3 messages):

pytorch rocm, NPU, iGPU


GPU MODE ▷ #self-promotion (3 messages):

LLM Profit Margins, GPU Stand Design


GPU MODE ▷ #🍿 (6 messages):

Code generation, Two-stage approach, CWM paper citation


GPU MODE ▷ #thunderkittens (17 messages🔥):

H100 matmul kernel runtime error, nvshmem usage rationale, RDMA implementation, PyTorch support for rocm symmetric memory


GPU MODE ▷ #submissions (24 messages🔥):

MI300x8, amd-all2all leaderboard, amd-gemm-rs leaderboard


GPU MODE ▷ #hardware (4 messages):

Voltage Park H100 donation, Nebius Exclusive Sponsorship, Future Hackathon Event


GPU MODE ▷ #factorio-learning-env (2 messages):

FLE Eval System Prompt, Agent0 System Prompt, PR Submission


GPU MODE ▷ #amd-competition (11 messages🔥):

gemm-rs optimizations, atomic operations, GPU rentals for debugging


GPU MODE ▷ #cutlass (4 messages):

TmemAllocator vs cute.arch.alloc_tmem, TMEM load/stores in cutedsl, SMEM -> TMEM copy, TMEM -> SMEM copy, Blackwell dense blockscaled GEMM example


GPU MODE ▷ #mojo (2 messages):

Metal GPU target, custom bitcode writer, mojo assembly


GPU MODE ▷ #low-bit-training (1 messages):

Modern QAT Papers, FP8 Training, MXFP4/NVFP4


Yannick Kilcher ▷ #general (105 messages🔥🔥):

Sinusoidal Positional Embeddings, Sine vs Cosine in Positional Encodings, Distillation performance estimates


Yannick Kilcher ▷ #paper-discussion (5 messages):

Applied math, arxiv paper, preprint paper


Yannick Kilcher ▷ #ml-news (6 messages):

SWE-bench verified, AlphaEvolve, Sakana AI, Yann LeCun, RL TTS


Latent Space ▷ #ai-general-chat (49 messages🔥):

Chrome DevTools MCP, Cursor CPU Usage, Meta Code World Model, Windsurf tab completion, ChatGPT Pulse


Latent Space ▷ #genmedia-creative-ai (1 messages):

swyxio: https://x.com/1littlecoder/status/1970624850386661766


Eleuther ▷ #general (11 messages🔥):

AI Psychology Project, Positional Embeddings, AI Future Predictions, Subtle Psychological Manipulation


Eleuther ▷ #research (20 messages🔥):

CFG on Style Transfer, Knowledge Graph Completion, Evolutionary Algorithms for Kids, Super-Bias: Mask-Aware Nonlinear Combiner, LoRAs and Super Bias Combiner


Eleuther ▷ #lm-thunderdome (7 messages):

GSM8k evaluation results, flexible vs strict matching, merged models issue, reproducibility of errors


Eleuther ▷ #multimodal-general (1 messages):

VLM, Mech Interp, Sparse Autoencoder (SAE)


Nous Research AI ▷ #general (26 messages🔥):

Meta code generation with world models, Training AI with arXiv data, Granite 4 model, RMS_NORM and NORM implementations


Nous Research AI ▷ #research-papers (3 messages):

AlphaXiv, Paper frustrations


Nous Research AI ▷ #interesting-links (3 messages):

Manifest AI, Open Source Integration


Nous Research AI ▷ #research-papers (3 messages):

AlphaXiv, Paper accessibility


DSPy ▷ #general (19 messages🔥):

PDF Processing with LLMs, OCR vs VLM for Layout Understanding, Qwen for OCR, Gemini 2.5 Pro for PDF/Image Understanding, DSPy and Attachments for PDF processing


DSPy ▷ #colbert (1 messages):

Context Length Limitations, Repeating CLS Token, Performance Issues


aider (Paul Gauthier) ▷ #questions-and-tips (8 messages🔥):

Aider context clearing, Internet access for Aider, LLM benchmarks for coding, Polyglot benchmark rerun


Manus.im Discord ▷ #general (7 messages):

Manus PDF download issues, Beta Pro Access


MCP Contributors (Official) ▷ #general-wg (4 messages):

ModelContextProtocol issues, ReadResourceResult contents array, Web Resource html


tinygrad (George Hotz) ▷ #general (1 messages):

Python Bindings, Direct Pip Installation


MLOps @Chipro ▷ #events (1 messages):

Diffusion Models, Generative Models, Paper Reading Group