Topic: "benchmarking"

GPT-5.2 (Instant/Thinking/Pro): 74% on GDPVal, 1.4x cost of GPT 5.1, on 10 Year OpenAI Anniversary

not much happened today

OpenRouter's State of AI - An Empirical 100 Trillion Token Study

Mistral 3: Mistral Large 3 + Ministral 3B/8B/14B open weights models

not much happened today

Black Forest Labs FLUX.2 [pro|flex|dev|klein]: near-Nano Banana quality but Open Weights

Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus

AI Engineer Code Summit

OpenAI fires back: GPT-5.1-Codex-Max (API) and GPT 5.1 Pro (ChatGPT)

Gemini 3 Pro — new GDM frontier model 6, Gemini 3 Deep Think, and Antigravity IDE

not much happened today

GPT 5.1 in ChatGPT: No evals, but adaptive thinking and instruction following

Terminal-Bench 2.0 and Harbor

Kimi K2 Thinking: 1T-A32B params, SOTA HLE, BrowseComp, TauBench && Soumith leaves Pytorch

not much happened today

not much happened today

The Karpathy-Dwarkesh Interview delays AGI timelines

Air Street's State of AI 2025 Report

not much happened today

Sora 2: new video+audio model and OpenAI's first Social Network

GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)

not much happened today

Oracle jumps +36% in a day after winning $300B OpenAI contract

not much happened today

Cohere Command A Reasoning beats GPT-OSS-120B and DeepSeek R1 0528

Databricks' $100B Series K

Western Open Models get Funding: Cohere $500m @ 6.8B, AI2 gets $152m NSF+NVIDIA grants

OpenAI's IMO Gold model also wins IOI Gold

not much happened today

not much happened today

Qwen-Image: SOTA text rendering + 4o-imagegen-level Editing Open Weights MMDiT

not much happened today

not much happened today

OAI and GDM announce IMO Gold-level results with natural language reasoning, no specialized training or tools, under human time limits

ChatGPT Agent: new o* model + unified Deep Research browser + Operator computer use + Code Interpreter terminal

not much happened today

not much happened today

Kimi K2 - SOTA Open MoE proves that Muon can scale to 15T tokens/1T params

Grok 4: xAI succeeds in going from 0 to new SOTA LLM in 2 years

SmolLM3: the SOTA 3B reasoning open source LLM

The Quiet Rise of Claude Code vs Codex

Gemini 2.5 Pro/Flash GA, 2.5 Flash-Lite in Preview

Execuhires Round 2: Scale-Meta, Lamini-AMD, and Instacart-OpenAI

Reasoning Price War 2: Mistral Magistral + o3's 80% price cut + o3-pro

Gemini 2.5 Pro (06-05) launched at AI Engineer World's Fair

not much happened today

DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release

not much happened today

not much happened today

not much happened today

Granola launches team notes, while Notion launches meeting transcription

not much happened today

Gemini 2.5 Pro Preview 05-06 (I/O edition) - the SOTA vision+coding model

not much happened today

not much happened today

LlamaCon: Meta AI gets into the Llama API platform business

Qwen 3: 0.6B to 235B MoE full+base models that beat R1 and o1

gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API

Gemini 2.5 Flash completes the total domination of the Pareto Frontier

Llama 4's Controversial Weekend Release

not much happened today

not much happened today

not much happened today

not much happened today

Cohere's Command A claims #3 open model spot (after DeepSeek and Gemma)

not much happened today

not much happened today

DeepSeek's Open Source Stack

not much happened today

not much happened today

Anthropic's $61.5B Series E

AI Engineer Summit Day 1

not much happened today

X.ai Grok 3 and Mira Murati's Thinking Machines

LLaDA: Large Language Diffusion Models

not much happened today

small news items

not much happened today

not much happened today

not much happened today

OpenAI takes on Gemini's Deep Research

o3-mini launches, OpenAI on "wrong side of history"

OpenAI launches Operator, its first Agent

not much happened today

not much happened this weekend

o3 solves AIME, GPQA, Codeforces, makes 11 years of progress in ARC-AGI and 25% in FrontierMath

Meta Apollo - Video Understanding up to 1 hour, SOTA Open Weights

Meta BLT: Tokenizer-free, Byte-level LLM

Google wakes up: Gemini 2.0 et al

$200 ChatGPT Pro and o1-full/pro, with vision, without API, and mixed reviews

Olympus has dropped (aka, Amazon Nova Micro|Lite|Pro|Premier|Canvas|Reel)

not much happened to end the week

Qwen with Questions: 32B open weights reasoning model nears o1 in GPQA/AIME/Math500

DeepSeek-R1 claims to beat o1-preview AND will be open sourced

Gemini (Experimental-1114) retakes #1 LLM rank with 1344 Elo

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

The AI Search Wars Have Begun — SearchGPT, Gemini Grounding, and more

DeepSeek Janus and Meta SpiRit-LM: Decoupled Image and Expressive Voice Omnimodality

Did Nvidia's Nemotron 70B train on test?

not much happened today

not much happened today

ChatGPT Advanced Voice Mode

not much happened today

nothing much happened today

o1: OpenAI's new general reasoning models

Pixtral 12B: Mistral beats Llama to Multimodality

not much happened today + AINews Podcast?

AIPhone 16: the Visual Intelligence Phone

Cerebras Inference: Faster, Better, AND Cheaper

super quiet day

Ideogram 2 + Berkeley Function Calling Leaderboard V2

not much happened today

Grok 2! and ChatGPT-4o-latest confuses everybody

not much happened today

GPT4o August + 100% Structured Outputs for All (GPT4o August edition)

How Carlini Uses AI

Apple Intelligence Beta + Segment Anything Model 2

AlphaProof + AlphaGeometry2 reach 1 point short of IMO Gold

Mistral Large 2 + RIP Mistral 7B, 8x7B, 8x22B

Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model

DataComp-LM: the best open-data 7B model/benchmark/dataset

Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o-mini version)

Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)

We Solved Hallucinations

Problems with MMLU-Pro

Qdrant's BM42: "Please don't trust us"

Gemini Nano: 50-90% of Gemini Pro, <100ms inference, on device, in Chrome Canary

Claude Crushes Code - 92% HumanEval and Claude.ai Artifacts

Hybrid SSM/Transformers > Pure SSMs/Pure Transformers

Francois Chollet launches $1m ARC Prize

Qwen 2 beats Llama 3 (and we don't know how)

Mamba-2: State Space Duality

Ten Commandments for Deploying Fine-Tuned Models

Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model

Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing

LMSys advances Llama 3 eval analysis

DeepSeek-V2 beats Mixtral 8x22B with >160 experts at HALF the cost

$100k to predict LMSYS human preferences in a Kaggle contest

Evals: The Next Generation

Not much happened today

Snowflake Arctic: Fully Open 10B+128x4B Dense-MoE Hybrid LLM

OpenAI's Instruction Hierarchy for the LLM OS

Perplexity, the newest AI unicorn

FineWeb: 15T Tokens, 12 years of CommonCrawl (deduped and filtered, you're welcome)

Llama-3-70b is GPT-4-level Open Model

Meta Llama 3 (8B, 70B)

Music's Dall-E moment

DBRX: Best open model (just not most efficient)

Claude 3 is officially America's Next Top Model

Welcome /r/LocalLlama!

DeepMind SIMA: one AI, 9 games, 600 tasks, vision+language ONLY

FSDP+QLoRA: the Answer to 70b-scale AI for desktop class GPUs

Inflection-2.5 at 94% of GPT4, and Pi at 6m MAU

Stable Diffusion 3 — Rombach & Esser did it again!

Claude 3 just destroyed GPT 4 (see for yourself)

Mistral Large disappoints

Google AI: Win some (Gemma, 1.5 Pro), Lose some (Image gen)

Adept Fuyu-Heavy: Multimodal model for Agents

Nightshade poisons AI art... kinda?

1/4/2024: Jeff Bezos backs Perplexity's $520m Series B.

12/30/2023: Mega List of all LLMs

12/28/2023: Smol Talk updates

12/22/2023: Anyscale's Benchmark Criticisms

12/9/2023: The Mixtral Rush