Topic: "model-performance"

Google I/O 2026: Gemini 3.5 Flash, Omni, and Google’s Agent Stack

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

not much happened today

Gemini 3.0 Flash Preview: 1/4 cost of Pro, but ~as smart, retakes Pareto Frontier

MCP -> Agentic AI Foundation, Mistral Devstral 2

DeepSeek V3.2 & 3.2-Speciale: GPT5-High Open Weights, Context Management, Plans for Compute Scaling

Mistral 3: Mistral Large 3 + Ministral 3B/8B/14B open weights models

not much happened today

Black Forest Labs FLUX.2 [pro|flex|dev|klein]: near-Nano Banana quality but Open Weights

Claude Opus 4.5: 3rd new SOTA coding model in past week, 1/3 the price of Opus

AI Engineer Code Summit

OpenAI fires back: GPT-5.1-Codex-Max (API) and GPT 5.1 Pro (ChatGPT)

Gemini 3 Pro — new GDM frontier model 6, Gemini 3 Deep Think, and Antigravity IDE

xAI Grok 4.1: #1 in Text Arena, #1 in EQ-bench, and better Creative Writing

Claude Haiku 4.5

Sora 2: new video+audio model and OpenAI's first Social Network

not much happened today

GDPVal finding: Claude Opus 4.1 within 95% of AGI (human experts in top 44 white collar jobs)

Cognition's $10b Series C; Smol AI updates

not much happened today

OpenAI's IMO Gold model also wins IOI Gold

not much happened today

OpenAI's gpt-oss 20B and 120B, Claude Opus 4.1, DeepMind Genie 3

Gemini 2.5 Deep Think finally ships

not much happened today

not much happened today

not much happened today

ChatGPT Agent: new o* model + unified Deep Research browser + Operator computer use + Code Interpreter terminal

not much happened today

not much happened today

Kimi K2 - SOTA Open MoE proves that Muon can scale to 15T tokens/1T params

SmolLM3: the SOTA 3B reasoning open source LLM

not much happened today

not much happened today

Chinese Models Launch - MiniMax-M1, Hailuo 2 "Kangaroo", Moonshot Kimi-Dev-72B

Execuhires Round 2: Scale-Meta, Lamini-AMD, and Instacart-OpenAI

not much happened today

Gemini 2.5 Pro (06-05) launched at AI Engineer World's Fair

DeepSeek-R1-0528 - Gemini 2.5 Pro-level model, SOTA Open Weights release

not much happened today

Mistral's Agents API and the 2025 LLM OS

Google I/O: new Gemini native voice, Flash, DeepThink, AI Mode (DeepSearch+Mariner+Astra)

not much happened today

ChatGPT Codex, OpenAI's first cloud SWE agent

codex-1 openai-o3 codex-mini gemma-3 blip3-o qwen-2.5 marigold-iid deepseek-v3 lightlab gemini-2.0 lumina-next openai runway salesforce qwen deepseek google google-deepmind j1 software-engineering parallel-processing multimodality diffusion-models depth-estimation scaling-laws reinforcement-learning fine-tuning model-performance multi-turn-conversation reasoning audio-processing sama kevinweil omarsar0 iscienceluvr akhaliq osanseviero c_valenzuelab mervenoyann arankomatsuzaki jasonwei demishassabis philschmid swyx teortaxestex jaseweston

OpenAI launched Codex, a cloud-based software engineering agent powered by codex-1 (an optimized version of OpenAI o3) available in research preview for Pro, Enterprise, and Team ChatGPT users, featuring parallel task execution like refactoring and bug fixing. The Codex CLI was enhanced with quick sign-in and a new low-latency model, codex-mini. Gemma 3 is highlighted as the best open model runnable on a single GPU. Runway released the Gen-4 References API for style transfer in generation. Salesforce introduced BLIP3-o, a unified multimodal model family using diffusion transformers for CLIP image features. The Qwen 2.5 models (1.5B and 3B versions) were integrated into the PocketPal app with various chat templates. Marigold IID, a new state-of-the-art open-source depth estimation model, was released. In research, DeepSeek shared insights on scaling and hardware for DeepSeek-V3. Google unveiled LightLab, a diffusion-based light source control in images. Google DeepMind's AlphaEvolve uses Gemini 2.0 to discover new math and reduce costs without reinforcement learning. Omni-R1 studied audio's role in fine-tuning audio LLMs. Qwen proposed a parallel scaling law inspired by classifier-free guidance. Salesforce released Lumina-Next on the Qwen base, outperforming Janus-Pro. A study found LLM performance degrades in multi-turn conversations due to unreliability. J1 is incentivizing LLM-as-a-Judge thinking via reinforcement learning. A new Qwen study correlates question and strategy similarity to predict reasoning strategies.

Granola launches team notes, while Notion launches meeting transcription

not much happened today

not much happened today

not much happened today

Cursor @ $9b, OpenAI Buys Windsurf @ $3b

not much happened today

gpt-image-1 - ChatGPT's imagegen model, confusingly NOT 4o, now available in API

not much happened today

DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level

Llama 4's Controversial Weekend Release

not much happened today

>$41B raised today (OpenAI @ 300b, Cursor @ 9.5b, Etched @ 1.5b)

not much happened today

OpenAI adopts MCP

Halfmoon is Reve Image: a new SOTA Image Model from ex-Adobe/Stability trio

Promptable Prosody, SOTA ASR, and Semantic VAD: OpenAI revamps Voice AI

Cohere's Command A claims #3 open model spot (after DeepSeek and Gemma)

not much happened today

Gemma 3 beats DeepSeek V3 in Elo, 2.0 Flash beats GPT4o with Native Image Gen

DeepSeek's Open Source Stack

not much happened today

Anthropic's $61.5B Series E

not much happened today

lots of small launches

not much happened today

AI Engineer Summit Day 1

X.ai Grok 3 and Mira Murati's Thinking Machines

not much happened today

small news items

not much happened today

not much happened today

OpenAI takes on Gemini's Deep Research

o3-mini launches, OpenAI on "wrong side of history"

Mistral Small 3 24B and Tulu 3 405B

TinyZero: Reproduce DeepSeek R1-Zero for $30

Project Stargate: $500b datacenter (1.7% of US GDP) and Gemini 2 Flash Thinking 2

not much happened today

PRIME: Process Reinforcement through Implicit Rewards

o3 solves AIME, GPQA, Codeforces, makes 11 years of progress in ARC-AGI and 25% in FrontierMath

ModernBert: small new Retriever/Classifier workhorse, 8k context, 2T tokens,

o1 API, 4o/4o-mini in Realtime API + WebRTC, DPO Finetuning

Meta Llama 3.3: 405B/Nova Pro performance at 70B price

$200 ChatGPT Pro and o1-full/pro, with vision, without API, and mixed reviews

not much happened today

Olympus has dropped (aka, Amazon Nova Micro|Lite|Pro|Premier|Canvas|Reel)

DeepSeek-R1 claims to beat o1-preview AND will be open sourced

Perplexity starts Shopping for you

BitNet was a lie?

not much happened today

Llama 3.2: On-device 1B/3B, and Multimodal 11B/90B (with AI2 Molmo kicker)

ChatGPT Advanced Voice Mode

a calm before the storm

o1 destroys Lmsys Arena, Qwen 2.5, Kyutai Moshi release

Learnings from o1 AMA

o1: OpenAI's new general reasoning models

Pixtral 12B: Mistral beats Llama to Multimodality

not much happened today + AINews Podcast?

not much happened this weekend

super quiet day

Ideogram 2 + Berkeley Function Calling Leaderboard V2

not much happened today

not much happened today

not much happened today

Grok 2! and ChatGPT-4o-latest confuses everybody

a quiet weekend

GPT4o August + 100% Structured Outputs for All (GPT4o August edition)

Llama 3.1 Leaks: big bumps to 8B, minor bumps to 70b, and SOTA OSS 405b model

DataComp-LM: the best open-data 7B model/benchmark/dataset

Mini, Nemo, Turbo, Lite - Smol models go brrr (GPT4o version)

SciCode: HumanEval gets a STEM PhD upgrade

Microsoft AgentInstruct + Orca 3

Problems with MMLU-Pro

RouteLLM: RIP Martian? (Plus: AINews Structured Summaries update)

Claude Crushes Code - 92% HumanEval and Claude.ai Artifacts

Gemini launches context caching... or does it?

Hybrid SSM/Transformers > Pure SSMs/Pure Transformers

Not much happened today

1 TRILLION token context, real time, on device?

Ten Commandments for Deploying Fine-Tuned Models

Chameleon: Meta's (unreleased) GPT4o-like Omnimodal Model

Cursor reaches >1000 tok/s finetuning Llama3-70b for fast file editing

Google I/O in 60 seconds

GPT-4o: the new SOTA-EVERYTHING Frontier model (GPT4O version)

Evals: The Next Generation

Not much happened today

Perplexity, the newest AI unicorn

Llama-3-70b is GPT-4-level Open Model

Mixtral 8x22B Instruct sparks efficiency memes

Lilian Weng on Video Diffusion

Mergestral, Meta MTIAv2, Cohere Rerank 3, Google Infini-Attention

Claude 3 is officially America's Next Top Model

Dia de las Secuelas (StarCoder, The Stack, Dune, SemiAnalysis)

Miqu confirmed to be an early Mistral-medium checkpoint

CodeLLama 70B beats GPT4 on HumanEval

codellama miqu mistral-medium llama-2-70b aphrodite-engine mixtral flatdolphinmaid noromaid rpcal chatml mistral-7b activation-beacon eagle-7b rwkv-v5 openhermes2.5 nous-hermes-2-mixtral-8x7b-dpo imp-v1-3b bakllava moondream qwen-vl meta-ai-fair ollama nous-research mistral-ai hugging-face ai-ethics alignment gpu-optimization direct-prompt-optimization fine-tuning cuda-programming optimizer-technology quantization multimodality context-length dense-retrieval retrieval-augmented-generation multilinguality model-performance open-source code-generation classification vision

Meta AI surprised the community with the release of CodeLlama, an open-source model now available on platforms like Ollama and MLX for local use. The Miqu model sparked debate over its origins, possibly linked to Mistral Medium or a fine-tuned Llama-2-70b, alongside discussions on AI ethics and alignment risks. The Aphrodite engine showed strong performance on A6000 GPUs with specific configurations. Role-playing AI models such as Mixtral and Flatdolphinmaid faced challenges with repetitiveness, while Noromaid and Rpcal performed better, with ChatML and DPO recommended for improved responses. Learning resources like fast.ai's course were highlighted for ML/DL beginners, and fine-tuning techniques with optimizers like Paged 8bit lion and adafactor were discussed. At Nous Research AI, the Activation Beacon project introduced a method for unlimited context length in LLMs using "global state" tokens, potentially transforming retrieval-augmented models. The Eagle-7B model, based on RWKV-v5, outperformed Mistral in benchmarks with efficiency and multilingual capabilities. OpenHermes2.5 was recommended for consumer hardware due to its quantization methods. Multimodal and domain-specific models like IMP v1-3b, Bakllava, Moondream, and Qwen-vl were explored for classification and vision-language tasks. The community emphasized centralizing AI resources for collaborative research.

RWKV "Eagle" v5: Your move, Mamba

1/17/2024: Help crowdsource function calling datasets

1/11/2024: Mixing Experts vs Merging Models

12/29/2023: TinyLlama on the way

12/27/2023: NYT vs OpenAI

12/26/2023: not much happened today

12/18/2023: Gaslighting Mistral for fun and profit

12/13/2023 SOLAR10.7B upstages Mistral7B?

12/9/2023: The Mixtral Rush

12/7/2023: Anthropic says "skill issue"