Model: "gpt-5"

nomos-1 axiomprover devstral-2-small deepseek-v3.2 claude-code cursor-2.2 claude-opus-4.5 gpt-5 claude-sonnet-4.5 gemini-3-pro llama qwen mistral gemma nousresearch thinkymachines mistral-ai deepseek anthropic cursor microsoft langchain-ai openai gemini intel vllm_project danielhanchen math formal-reasoning agentic-systems asynchronous-execution multi-agent-systems observability benchmarking quantization post-training-quantization training-speedup kernel-optimization inference-efficiency

NousResearch's Nomos 1 is a 30B open math model achieving a top Putnam score with only ~3B active parameters, enabling consumer Mac inference. AxiomProver also posts top Putnam results using ThinkyMachines' RL stack. Mistral's Devstral 2 Small outperforms DeepSeek v3.2 in 71% of preferences with better speed and cost. Anthropic's Claude Code introduces asynchronous agent execution. Cursor 2.2 adds deep agent primitives like Debug and Plan Modes. VS Code launches unified agent chat sessions improving multi-agent workflows. LangChain releases "Polly" for agent observability. The Stirrup harness leads OpenAI GDPval benchmarks with Claude Opus 4.5, GPT-5, and Gemini 3 Pro following. Advances in quantization include vLLM integrating Intel's AutoRound PTQ for efficient serving. Unsloth achieves up to 3× training speedups with new kernels across Llama, Qwen, Mistral, and Gemma models. "Compositional reasoning + specialized post-training under constrained active params can rival frontier closed models on formal math."

Nov 21

AI Engineer Code Summit

gemini-3-pro-image gemini-3 gpt-5 claude-3.7-sonnet google-deepmind togethercompute image-generation fine-tuning benchmarking agentic-ai physics model-performance instruction-following model-comparison time-horizon user-preference demishassabis omarsar0 lintool hrishioa teknium artificialanlys minyangtian1 ofirpress metr_evals scaling01

The recent AIE Code Summit showcased key developments including Google DeepMind's Gemini 3 Pro Image model, Nano Banana Pro, which features enhanced text rendering, 4K visuals, and fine-grained editing capabilities. Community feedback highlights its strong performance in design and visualization tasks, with high user preference scores. Benchmarking updates reveal the new CritPt physics frontier benchmark where Gemini 3 Pro outperforms GPT-5, though AI still lags on complex unseen research problems. Agentic task evaluations show varied time horizons and performance gaps between open-weight and closed frontier models, emphasizing ongoing challenges in AI research and deployment. "Instruction following remains jagged for some users," and model fit varies by use case, with Gemini 3 excelling in UI and code tasks but showing regressions in transcription and writing fidelity.

Nov 20

Nano Banana Pro (Gemini Image Pro) solves text-in-images, infographic generation, 2-4k resolution, and Google Search grounding

gemini-3-pro gpt-5 google openai hugging-face togethercompute lmsys image-generation text-rendering model-provenance scientific-research proof-assistance multimodal-integration api-access fine-tuning jeffdean kevinweil demishassabis

Google launched Gemini 3 Pro Image (Nano Banana Pro), a next-generation AI image generation and editing model with integrated Google Search grounding, multi-image composition, and fine-grained visual controls, offering pricing at $0.134 per 2K image and $0.24 per 4K image. It features improved text rendering with error rates dropping from 56% to 8% compared to its predecessor, and includes SynthID watermark checks for provenance. The model is available via Gemini App, API, LM Arena, Hugging Face Spaces, Together AI, and Flow. Meanwhile, OpenAI shared early experiments with GPT-5 accelerating scientific research, including proofs of previously unsolved problems in math, physics, biology, and materials science. "GPT-5 accelerated research tasks in math/physics/biology/materials; in 4, it helped find proofs of previously unsolved problems."

Nov 17

xAI Grok 4.1: #1 in Text Arena, #1 in EQ-bench, and better Creative Writing

grok-4.1 gpt-5.1 claude-4.1-opus grok-4 gpt-5 grok-4.1-thinking gpt-5-pro claude-4.5-haiku xai openai google-deepmind sakana-ai anthropic microsoft mufg khosla nea lux-capital iqt model-performance creative-writing hallucination evaluation-datasets ensemble-models weather-forecasting funding efficiency anti-hallucination arc-agi model-scaling yanndubs gregkamradt philschmid willccbb

xAI launched Grok 4.1, achieving a #1 rank on the LM Arena Text Leaderboard with an Elo score of 1483, showing improvements in creative writing and anti-hallucination. OpenAI's GPT-5.1 "Thinking" demonstrates efficiency gains with ~60% less "thinking" on easy queries and strong ARC-AGI performance. Google DeepMind released WeatherNext 2, an ensemble generative model that is 8× faster and more accurate for global weather forecasts, integrated into multiple Google products. Sakana AI raised ¥20B ($135M) in Series B funding at a $2.63B valuation to focus on efficient AI for resource-constrained enterprise applications in Japan. New evaluations highlight tradeoffs between hallucination and knowledge accuracy across models including Claude 4.1 Opus and Anthropic models.

Nov 11

not much happened today

gpt-5 qwen2.5-7b ernie-4.5-vl-28b-a3b-thinking gemini-2.5-pro llamacloud claude-code openai baidu databricks llamaindex togethercompute sakanaailabs reasoning-benchmarks reinforcement-learning fine-tuning multimodality document-intelligence retrieval-augmented-generation agentic-systems persona-simulation code-agents guardrails sakanaailabs micahgoldblum francoisfleuret matei_zaharia jerryjliu0 omarsar0 togethercompute imjaredz theo

GPT-5 leads Sudoku-Bench solving 33% of puzzles but 67% remain unsolved, highlighting challenges in meta-reasoning and spatial logic. New training methods like GRPO fine-tuning and "Thought Cloning" show limited success. Research on "looped LLMs" suggests pretrained models benefit from repeated computation for better performance. Baidu's ERNIE-4.5-VL-28B-A3B-Thinking offers lightweight multimodal reasoning with Apache 2.0 licensing, outperforming Gemini-2.5-Pro and GPT-5-High on document tasks. Databricks ai_parse_document preview delivers cost-efficient document intelligence outperforming GPT-5 and Claude. Pathwork AI uses LlamaCloud for underwriting automation. Gemini File Search API enables agentic retrieval augmented generation (RAG) with MCP server integration. Together AI and Collinear launch TraitMix for persona-driven agent simulations integrated with Together Evals. Reports highlight risks in long-running code agents like Claude Code reverting changes, emphasizing guardrails. Community consensus favors multiple code copilots including Claude Code, Codex, and others.

Oct 17

The Karpathy-Dwarkesh Interview delays AGI timelines

claude-haiku-4.5 gpt-5 arch-router-1.5b anthropic openai huggingface langchain llamaindex google epoch-ai reasoning long-context sampling benchmarking data-quality agent-frameworks modular-workflows ide-extensions model-routing graph-first-agents real-world-grounding karpathy aakaran31 du_yilun giffmana omarsar0 jeremyphoward claude_code mikeyk alexalbert__ clementdelangue jerryjliu0

The recent AI news highlights the Karpathy interview as a major event, alongside significant discussions on reasoning improvements without reinforcement learning, with test-time sampling achieving GRPO-level performance. Critiques on context window marketing reveal effective limits near 64K tokens, with Claude Haiku 4.5 showing competitive reasoning speed. GPT-5 struggles with advanced math benchmarks, and data quality issues termed "Brain Rot" affect model reasoning and safety. In agent frameworks, Anthropic Skills enable modular coding workflows, OpenAI Codex IDE extensions enhance developer productivity, and HuggingChat Omni introduces meta-routing across 100+ open models using Arch-Router-1.5B. LangChain and LlamaIndex advance graph-first agent infrastructure, while Google Gemini integrates with Google Maps for real-world grounding.

Oct 15

Claude Haiku 4.5

claude-3.5-sonnet claude-3-haiku claude-3-haiku-4.5 gpt-5 gpt-4.1 gemma-2.5 gemma o3 anthropic google yale artificial-analysis shanghai-ai-lab model-performance fine-tuning reasoning agent-evaluation memory-optimization model-efficiency open-models cost-efficiency foundation-models agentic-workflows swyx sundarpichai osanseviero clementdelangue deredleritt3r azizishekoofeh vikhyatk mirrokni pdrmnvd akhaliq sayashk gne

Anthropic released Claude Haiku 4.5, a model that is over 2x faster and 3x cheaper than Claude Sonnet 4.5, improving iteration speed and user experience significantly. Pricing comparisons highlight Haiku 4.5's competitive cost against models like GPT-5 and GLM-4.6. Google and Yale introduced the open-weight Cell2Sentence-Scale 27B (Gemma) model, which generated a novel, experimentally validated cancer hypothesis, with open-sourced weights for community use. Early evaluations show GPT-5 and o3 models outperform GPT-4.1 in agentic reasoning tasks, balancing cost and performance. Agent evaluation challenges and memory-based learning advances were also discussed, with contributions from Shanghai AI Lab and others. "Haiku 4.5 materially improves iteration speed and UX," and "Cell2Sentence-Scale yielded validated cancer hypothesis" were key highlights.

Sep 17

not much happened today

gpt-5 gemini-2.5-deep-think anthropic openai google-deepmind apollo-evaluations github hugging-face weaviate reasoning reinforcement-learning alignment chain-of-thought model-evaluation agent-frameworks ide-integration natural-language-to-sql real-time-voice sama merettm woj_zaremba markchen90 esyudkowsky

Anthropic published an in-depth postmortem on their August-September reliability issues. OpenAI's GPTeam achieved a perfect 12/12 score at the ICPC 2025 World Finals, showcasing rapid progress in general-purpose reasoning and introducing controllable "thinking time" tiers for gpt-5 in ChatGPT. Google DeepMind's gemini-2.5-deep-think earned a gold medal level at ICPC, solving 10/12 problems with advances in parallel thoughts, multi-step reasoning, and novel reinforcement learning techniques. OpenAI and Apollo Evaluations detected "scheming" behaviors in frontier models, emphasizing the need for chain-of-thought transparency and launching a $500K Kaggle challenge. GitHub launched an MCP server registry integrated with VS Code Insiders, with additional support from JetBrains and Hugging Face for open LLMs in Copilot Chat. Weaviate released a native Query Agent translating natural language to database operations with citations.

Sep 13

not much happened today

mobilellm-r1 qwen3-next-80b-a3b gpt-5 meta-ai-fair huggingface alibaba openai reasoning model-efficiency hybrid-attention long-context benchmarking agent-evaluation hallucination-detection model-calibration inference-complexity model-pricing _akhaliq tacocohen pkirgis sayashk

Meta released MobileLLM-R1, a sub-1B parameter reasoning model family on Hugging Face with strong small-model math accuracy, trained on 4.2T tokens. Alibaba introduced Qwen3-Next-80B-A3B with hybrid attention, 256k context window, and improved long-horizon memory, priced competitively on Alibaba Cloud. Meta AI FAIR fixed a benchmark bug in SWE-Bench affecting agent evaluation. LiveMCP-101 benchmark shows frontier models like GPT-5 underperform on complex tasks with common failure modes cataloged. OpenAI highlights hallucination issues due to benchmark incentives, proposing calibration improvements. Community demos and tooling updates continue to evolve.

Sep 09

not much happened today

gpt-5 kimi-k2-0905 glm-4.5 qwen3-asr opus-4.1 cognition founders-fund lux-capital 8vc neo vercel claude groq alibaba huggingface meta-ai-fair google theturingpost algoperf coding-agents agent-architecture open-source model-evaluation multilingual-models speech-recognition model-optimization kv-cache quantization algorithmic-benchmarking video-generation context-windows swyx tim_dettmers

Cognition raised $400M at a $10.2B valuation to advance AI coding agents, with swyx joining to support the "Decade of Agents" thesis. Vercel launched an OSS "vibe coding platform" using a tuned GPT-5 agent loop. Claude Code emphasizes minimalism in agent loops for reliability. Kimi K2-0905 achieved 94% on coding evals and improved agentic capabilities with doubled context length. Alibaba released Qwen3-ASR, a multilingual transcription model with <8% WER. Meta introduced Set Block Decoding for 3-5× faster decoding without architectural changes. Innovations in KV cache compression and quantization include AutoRound, QuTLASS v0.1.0, and AlgoPerf v0.6. Google's Veo 3 video generation API went GA with significant price cuts and vertical video support.

Sep 08

Cognition's $10b Series C; Smol AI updates

kimi-k2-0905 qwen3-asr gpt-5 cognition vercel meta-ai-fair alibaba groq huggingface coding-agents agent-development open-source model-evaluation multilingual-models inference-optimization kv-cache-compression quantization algorithmic-benchmarking context-length model-performance swyx

Cognition raised $400M at a $10.2B valuation to advance AI coding agents, with swyx joining the company. Vercel launched an OSS coding platform using a tuned GPT-5 agent loop. The Kimi K2-0905 model achieved top coding eval scores and improved agentic capabilities with doubled context length. Alibaba released Qwen3-ASR, a multilingual transcription model with robust noise handling. Meta introduced Set Block Decoding for 3-5× faster decoding without architectural changes. Innovations in KV cache compression and quantization were highlighted, including AutoRound in SGLang and QuTLASS v0.1.0 for Blackwell GPUs. Algorithmic benchmarking tools like AlgoPerf v0.6 were updated for efficiency.

Sep 02

Anthropic raises $13B at $183B Series F

claude-code gpt-5 grok-4 claude sonnet-4 glm-4.5 deepseek-r1 anthropic mistral-ai x-ai salesforce galileo openpipe zhipu thudm enterprise-connectors agent-benchmarking reinforcement-learning inference-optimization memory-optimization cuda multi-token-prediction speculative-decoding tensor-offload performance-optimization real-time-guardrails cost-optimization swyx emilygsands _philschmid _lewtun omarsar0 _avichawla corbtt

Anthropic achieved a $183B post-money valuation in Series F funding by September 2025, growing from about $1B run-rate in January to over $5B run-rate by August 2025. Their Claude Code product saw >10x usage growth in three months and reached $500M run-rate revenue, serving over 300,000 business customers with a nearly 7x increase in large accounts. Mistral AI launched Le Chat with 20+ MCP connectors integrating with major SaaS platforms and persistent memory features. Benchmarking updates highlight GPT-5 leading agent intelligence indices, with strong performances from xAI's Grok and Anthropic's Claude families. Reliability tooling and agent evaluation advances were shared by Galileo, OpenPipe, and others. Zhipu/THUDM open-sourced Slime v0.1.0, enhancing RL infrastructure behind GLM-4.5 with significant decoding speed improvements and advanced tensor offload techniques.

Sep 01

not much happened today

gpt-5 grok-code-fast-1 claude-sonnet glm-4.5 longcat-flash-chat fastvlm mobileclip2 internvl3.5 openai x-ai zhipu-ai meituan apple model-architecture moe adaptive-compute inference-speed model-training cost-efficiency coding developer-tools open-inference on-device-ai vision gdb martin_casado yanndubs elonmusk cline vikhyatk dzhng quixiai tim_dettmers casper_hansen_ reach_vb eliebakouch teortaxestex youjiacheng

OpenAI integrates GPT-5 into Xcode 26 with improved coding latency, though some UX trade-offs are noted. xAI's Grok Code Fast 1 gains momentum, surpassing Claude Sonnet in usage and praised for fast debugging. Zhipu's GLM-4.5 offers a cost-effective coding plan with strong performance against Claude Sonnet 4. Meituan releases the LongCat-Flash-Chat, a 560B parameter MoE model with adaptive compute and detailed technical insights. Apple debuts on-device vision-language models FastVLM and MobileCLIP2 alongside InternVL3.5.

Aug 29

not much happened today

fastvlm mobileclip2 grok-code-fast-1 gpt-5 qwen-3-coder-30b-a3b apple hugging-face x-ai openai groq run-llama lmstudio vision model-quantization code-generation cli-workflows retrieval-augmentation embedding-models local-ai multimodality reach_vb xenovacom pcuenq awnihannun cline veggie_eric nickbaumann_ gdb benankdev loganmarkewich tom_doerr fastmcp ggerganov orionweller antoine_chaffin

Apple released three real-time vision-language models (FastVLM, MobileCLIP2) on Hugging Face with significant speed and size improvements, supporting WebGPU and Core ML. Their MLX framework now supports MXFP4 format, competing with NVFP4 for FP4 quantization. xAI launched grok-code-fast-1, outperforming Claude for code edits, while OpenAI integrated GPT-5 into Xcode 26 and released a new Responses API on Groq hardware. CLI-first agent workflows advanced with tools like SemTools, MLX local runner for Apple Silicon, and llama.vim recommending Qwen 3 Coder 30B A3B. Retrieval research highlights limitations of single-vector embeddings, promoting ColBERT-style late interaction.

Aug 25

not much happened today

grok-2 grok-2.5 vibevoice-1.5b motif-2.6b gpt-5 qwen-code xai-org microsoft motif-technology alibaba huggingface langchain-ai mixture-of-experts model-scaling model-architecture text-to-speech fine-tuning training-data optimization reinforcement-learning agentic-ai tool-use model-training model-release api software-development model-quantization elonmusk clementdelangue rasbt quanquangu akhaliq eliebakouch gdb ericmitchellai ivanfioravanti deanwball giffmana omarsar0 corbtt

xAI released open weights for Grok-2 and Grok-2.5 with a novel MoE residual architecture and μP scaling, sparking community excitement and licensing concerns. Microsoft open-sourced VibeVoice-1.5B, a multi-speaker long-form TTS model with streaming support and a 7B variant forthcoming. Motif Technology published a detailed report on Motif-2.6B, highlighting Differential Attention, PolyNorm, and extensive finetuning, trained on AMD MI250 GPUs. In coding tools, momentum builds around GPT-5-backed workflows, with developers favoring it over Claude Code. Alibaba released Qwen-Code v0.0.8 with deep VS Code integration and MCP CLI enhancements. The MCP ecosystem advances with LiveMCP-101 stress tests, the universal MCP server "Rube," and LangGraph Platform's rollout of revision queueing and ART integration for RL training of agents.

Aug 20

DeepSeek V3.1: 840B token continued pretrain, beating Claude 4 Sonnet at 11% of its cost

deepseek-v3.1 seed-oss-36b computerrl gemini-2.5-pro gpt-5 claude-code gpt-oss-120b gpt-oss-20b deepseek bytedance zhipu-ai github microsoft anthropic together-ai baseten huggingface token-efficiency coding agentic-benchmarks long-context reinforcement-learning developer-tools fine-tuning multinode-training model-release teortaxestex rasbt lukehoban burkeholland _catwu cline winglian

DeepSeek released DeepSeek V3.1, a quietly rolled out open model with an 128K context window and improvements in token efficiency, coding, and agentic benchmarks. ByteDance launched the permissive Seed-OSS 36B model on Hugging Face, noted for long-context and reasoning capabilities. Zhipu AI introduced ComputerRL, a reinforcement learning framework for computer-use agents, achieving strong benchmark results. In developer tooling, GitHub Copilot expanded globally, Microsoft VS Code integrated Gemini 2.5 Pro and updated GPT-5 agent prompts, and Anthropic launched Claude Code seats with spend controls. Open-source fine-tuning advances include Together AI adding SFT for gpt-oss-120B/20B and Baseten enabling multinode 120B training with Truss CLI. The community noted mixed performance and ongoing post-training adjustments for DeepSeek V3.1.

Aug 15

not much happened today

gpt-5 gpt-5-high gpt-5-mini-high gpt-5-nano-high imagen-4 gemma-3-270m openai google lmsys model-releases model-performance prompt-engineering developer-tools image-generation model-optimization transformers tokenization model-scaling sama aidan_mclau kevinweil lmarena_ai edwinarbus gdb omarsar0 philschmid m4rkmc

OpenAI rolled out GPT-5 as the default in ChatGPT with new modes and a "warmer" personality, plus expanded message limits for Plus/Team users and Enterprise/Edu access. Performance rankings show gpt-5-high leading, with smaller variants also ranked, though critiques note some underperformance versus Chinese models and sensitivity to sycophancy. OpenAI enhanced developer tools with a "Quick eval" feature, coding tips, and an improved Playground. Google released Imagen 4 generally available with faster generation and higher resolution, plus the ultra-small Gemma 3 270M model with a large vocabulary and ecosystem support. Podcasts featured OpenAI leaders discussing GPT-5 systems, routing, and efficiency.

Aug 14

Western Open Models get Funding: Cohere $500m @ 6.8B, AI2 gets $152m NSF+NVIDIA grants

gpt-5 o3 command-a gemma-3-270m imagen-4 dinov3 openai perplexity-ai ai2 nvidia cohere meta-ai-fair google hugging-face ollama unsloth model-speed funding ai-infrastructure on-device-ai quantization embedding-models image-generation self-supervised-learning vision dense-prediction benchmarking instruction-following model-optimization model-release challenge joelle_pineau fchollet awnihannun _philschmid osanseviero

OpenAI's GPT-5 achieved a speedrun of Pokemon Red 3x faster than o3. Perplexity raised $200M at a $20B valuation. AI2 secured $75M NSF grants and $77M from NVIDIA for AI infrastructure projects like Olmo and Molmo. Cohere raised $500M and hired Joelle Pineau from meta-ai-fair, boosting models like Command A. Google released the Gemma 3 270M on-device tiny LLM with INT4 QAT checkpoints and large embedding tables, and made Imagen 4 generally available with a fast version at $0.02/image. Meta-ai-fair introduced DINOv3, a family of self-supervised vision foundation models with high-resolution dense features and strong performance on benchmarks like COCO detection and ADE20K segmentation, under a permissive license. A $150,000 MiniMax AI Agent Challenge is ongoing with 200+ prizes, encouraging AI project builds by August 25.

Aug 13

not much happened today

gpt-5 gpt-oss-120b opus-4.1 sonnet-4 openai anthropic minimax context-windows model-routing model-hosting multi-tool-pipelines prompt-caching model-extraction model-pairing cost-efficiency model-optimization sama jeremyphoward jxmnop _catwu

OpenAI continues small updates to GPT-5, introducing "Auto/Fast/Thinking" modes with 196k token context, 3,000 messages/week, and dynamic routing to cheaper models for cost efficiency. The MiniMax AI Agent Challenge offers $150,000 in prizes for AI agent development by August 25. The community discusses GPT-OSS-120B base model extraction, hosting, and tooling improvements, including multi-tool pipelines and flex-attention. Anthropic announces model pairing in Claude Code with Opus 4.1 for planning and Sonnet 4 for execution, expanding context to 1M tokens and introducing prompt caching. Key figures include @sama, @jeremyphoward, @jxmnop, and @_catwu.

Aug 12

not much happened today

gpt-5 gpt-5-mini gpt-5-nano claude-sonnet-4 glm-4.5v genie-3 gemini-app qwen-image-distilled matrix-game-2.0 jan-v1 qwen3-4b-thinking openai anthropic zhipu-ai google-deepmind alibaba skywork jan-ai context-window multimodality reinforcement-learning agentic-tasks video-generation image-generation real-time-systems web-search model-accuracy developer-tools open-source-models long-context model-scaling

OpenAI released the GPT-5 series including GPT-5-mini and GPT-5-nano, with mixed user feedback on performance and API behavior. Anthropic extended Claude Sonnet 4 context window to 1 million tokens, a 5x increase, enhancing large document processing. Zhipu AI launched the open-source multimodal GLM-4.5V model with improvements in RL scaling and agentic tasks. Google DeepMind showcased the video generation model Genie 3 and updated the Gemini App with new features like Deep Think and Gemini Live. Alibaba Qwen released the distilled image model Qwen-Image distilled and enhanced their Deep Research capabilities. Open source models like Skywork's Matrix-Game 2.0 and Jan.ai's Jan-v1 (built on Qwen3-4B-Thinking) were introduced, focusing on real-time world modeling and web search respectively. Developer tools such as Claude Code and Cursor were also highlighted.

Aug 11

OpenAI's IMO Gold model also wins IOI Gold

gpt-5 gpt-5-thinking gpt-5-mini gemini-2.5-pro claude opus-4.1 openai google-deepmind anthropic reinforcement-learning benchmarking model-performance prompt-engineering model-behavior competitive-programming user-experience model-naming model-selection hallucination-detection sama scaling01 yanndubs sherylhsu ahmed_el-kishky jerry_tworek noam_brown alex_wei amandaaskell ericmitchellai jon_durbin gdb jerryjliu0

OpenAI announced placing #6 among human coders at the IOI, reflecting rapid progress in competitive coding AI over the past two years. The GPT-5 launch faced significant user backlash over restrictive usage limits and removal of model selection control, leading to a reversal and increased limits to 3000 requests per week for Plus users. Confusion around GPT-5 naming and benchmarking was highlighted, with critiques on methodological issues comparing models like Claude and Gemini. Performance reviews of GPT-5 are mixed, with claims of near-zero hallucinations by OpenAI staff but user reports of confidence in hallucinations and steering difficulties. Benchmarks show GPT-5 mini performing well on document understanding, while the full GPT-5 is seen as expensive and middling. On the Chatbot Arena, Gemini 2.5 Pro holds a 67% winrate against GPT-5 Thinking. Prompting and model behavior remain key discussion points.

Aug 08

not much happened today

gpt-5 gpt-4o grok-4 claude-4-sonnet openai microsoft reasoning latency model-routing benchmarking reinforcement-learning hallucination-control creative-writing priority-processing api-traffic model-deprecation user-experience model-selection voice-mode documentation sama nickaturley elaineyale6 scaling01 mustafasuleyman kevinweil omarsar0 jeremyphoward juberti epochairesearch lechmazur gdb

OpenAI launched GPT-5 with a unified user experience removing manual model selection, causing initial routing and access issues for Plus users that are being addressed with fixes including restored model options and increased usage limits. GPT-5 introduces "Priority Processing" for lower latency at higher price tiers, achieving ~750ms median time-to-first-token in some cases. Microsoft reports full Copilot adoption of GPT-5, and API traffic doubled within 24 hours, peaking at 2 billion tokens per minute. Early benchmarks show GPT-5 leading in reasoning tasks like FrontierMath and LiveBench, with improvements in hallucination control and creative writing, though some models like Grok-4 and Claude-4 Sonnet Thinking outperform it in specific RL-heavy reasoning benchmarks. OpenAI also released extensive migration and feature guides but faced some rollout issues including a broken code sample and a problematic Voice Mode launch. "Unified GPT-5" ends model pickers, pushing developers away from manual model selection.

Aug 07

OpenAI rolls out GPT-5 and GPT-5 Thinking to >1B users worldwide; -mini and -nano help claim Pareto Frontier

gpt-5 gpt-5-mini gpt-5-nano claude-4.1-sonnet claude-4.1-opus openai cursor_ai jetbrains microsoft notion perplexity_ai factoryai model-architecture context-windows pricing-models coding long-context prompt-engineering model-benchmarking model-integration tool-use reasoning sama scaling01 jeffintime embirico mustafasuleyman cline lmarena_ai nrehiew_ ofirpress sauers_

OpenAI launched GPT-5, a unified system featuring a fast main model and a deeper thinking model with a real-time router, supporting up to 400K context length and aggressive pricing that reclaims the Pareto Frontier of Intelligence. The rollout includes variants like gpt-5-mini and gpt-5-nano with significant cost reductions, and integrations with products such as ChatGPT, Cursor AI, JetBrains AI Assistant, Microsoft Copilot, Notion AI, and Perplexity AI. Benchmarks show GPT-5 performing strongly in coding and long-context reasoning, roughly matching Claude 4.1 Sonnet/Opus on SWE-bench Verified. The launch was accompanied by a GPT-5 prompting cookbook and notable community discussions on pricing and performance.

Aug 01

Gemini 2.5 Deep Think finally ships

gemini-2.5-deep-think gpt-oss gpt-5 kimi-k2-turbo-preview qwen3-coder-flash glm-4.5 step-3 claude openai anthropic google-deepmind kimi-moonshot alibaba ollama zhipu-ai stepfun parallel-thinking model-releases moe attention-mechanisms multimodal-reasoning model-performance context-windows open-source-models model-leaks creative-ai coding reasoning model-optimization demishassabis philschmid scaling01 teortaxestex teknium1 lmarena_ai andrewyng

OpenAI is rumored to soon launch new GPT-OSS and GPT-5 models amid drama with Anthropic revoking access to Claude. Google DeepMind quietly launched Gemini 2.5 Deep Think, a model optimized for parallel thinking that achieved gold-medal level at the IMO and excels in reasoning, coding, and creative tasks. Leaks suggest OpenAI is developing a 120B MoE and a 20B model with advanced attention mechanisms. Chinese AI companies like Kimi Moonshot, Alibaba, and ZHIpu AI are releasing faster and more capable open models such as kimi-k2-turbo-preview, Qwen3-Coder-Flash, and GLM-4.5, signaling strong momentum and potential to surpass the U.S. in AI development. "The final checkpoint was selected just 5 hours before the IMO problems were released," highlighting rapid development cycles.

Jul 31

Figma's $50+b IPO

horizon-alpha gpt-5 gemini-2.5-pro qwen3-coder qwen3-coder-flash-30b-a3b command-a-vision gpt-4.1 llama-4-maverick flux-1-krea-dev glm-4.5 voxtral openai openrouter alibaba unslothai cohere huggingface black-forest-labs diffusers ostrisai zhipu-ai together-ai mistral-ai reasoning svg-generation agentic-ai context-windows vision fine-tuning inference-time-training model-generalization open-models technical-reports scaling01 teortaxestex huybery nickfrosst aidangomez reach_vb zai_org corbtt jxmnop teknuim1

OpenAI's stealth model horizon-alpha on OpenRouter sparks speculation as a precursor to GPT-5, showing strong reasoning and SVG generation capabilities, comparable to Gemini 2.5 Pro. Alibaba released the Qwen3-Coder family, including a fast Qwen3-Coder-Flash (30B-A3B) variant with agentic features and 1M context length support via UnslothAI. Cohere launched Command A Vision, a 111B parameter open-weights vision-language model outperforming GPT-4.1 and Llama 4 Maverick on enterprise benchmarks. Black Forest Labs introduced FLUX.1 Krea [dev], an open-weights photorealism model compatible with fine-tuning tools like diffusers and ostrisai. Zhipu AI unveiled GLM-4.5, a hybrid reasoning open model with agentic capabilities available on Together AI. Discussions highlight the rising importance of inference-time training and reasoning model generalization. Mistral AI released the technical report for Voxtral continuing its open science efforts.

Jul 28

GLM-4.5: Deeper, Headier, & better than Kimi/Qwen/DeepSeek (SOTA China LLM?)

glm-4.5-355b-a32b glm-4.5-air-106b-a12b qwen3-coder claude-4-opus grok-4 o3 gpt-4.1 gpt-5 kimi-k2 claude-sonnet-4 z-ai alibaba huggingface openai reinforcement-learning token-efficiency model-optimization open-source-models agentic-ai coding model-training lupantech teortaxestex mervenoyann _lewtun scaling01 cline

Z.ai (Zhipu AI) released the GLM-4.5-355B-A32B and GLM-4.5-Air-106B-A12B open weights models, claiming state-of-the-art performance competitive with Claude 4 Opus, Grok 4, and OpenAI's o3. These models emphasize token efficiency and efficient reinforcement learning training validated by the Muon optimizer. Alibaba Qwen introduced Group Sequence Policy Optimization (GSPO), a new reinforcement learning algorithm powering the Qwen3 model suite, integrated into Hugging Face's TRL library. Speculation surrounds mystery models "summit" and "zenith" as potential GPT-5 variants based on GPT-4.1 architecture. Qwen3-Coder shows strong coding benchmark results, rivaling Claude Sonnet 4 and Kimi K2. The rise of powerful Chinese open-source models like GLM-4.5, Wan-2.2, and Qwen3 Coder contrasts with a slowdown from Western labs such as OpenAI.

Jul 25

not much happened today

gpt-5 gpt4-0314 qwen3-235b-thinking runway-aleph imagen-4-ultra smollm3 grok-4 openai alibaba runway hugging-face google anthropic pytorch lmarena reinforcement-learning reasoning video-generation image-generation model-optimization open-source model-performance inference-speed integration stability sama clementdelangue xikun_zhang_ teknnium1 chujiezheng

OpenAI has fully rolled out its ChatGPT agent to all Plus, Pro, and Team users and is building hype for the upcoming GPT-5, which reportedly outperforms Grok-4 and can build a cookie clicker game in two minutes. Alibaba's Qwen team released the open-source reasoning model Qwen3-235B-Thinking, achieving an 89% win rate over gpt4-0314 using a new RL algorithm called Group Sequence Policy Optimization (GSPO). Runway introduced Runway Aleph, a state-of-the-art in-context video model for editing and generating video content. Hugging Face highlights the growing momentum of open-source AI, especially from Chinese teams. Other updates include Kling's upgrades for image-to-video generation and Google's Imagen 4 Ultra being recognized as a top text-to-image model. Anthropic integrated Claude with Canva for branded visual designs but faces stability issues. The PyTorch team released optimized checkpoints for SmolLM3 to speed up inference.

Jul 14

not much happened today

kimi-k2 grok-4 gpt-5 gemini-2.5 gemini-embedding cognition windsurf moonshot-ai x-ai openai google stanfordnlp huggingface mixture-of-experts model-training model-performance fine-tuning benchmarking agentic-ai model-bugs embedding-models sama hardmaru jeremyphoward akhaliq teortaxestex yuchenj_uw demishassabis

Cognition is acquiring the remaining assets of Windsurf after a significant weekend deal. Moonshot AI released Kimi K2, an open-source, MIT-licensed agentic model with 1 Trillion total / 32B active parameters using a Mixture-of-Experts architecture, trained on 15.5 Trillion tokens with the MuonClip optimizer, showing top performance on benchmarks like EQ-Bench and Creative Writing. xAI launched Grok-4, ranking 5th on IQ Bench but with notable quirks including a bug causing it to respond only with "Heavy" and a high frequency of Elon Musk mentions. Rumors about OpenAI delaying an open-source model release surfaced, with speculation about CEO sama's PR strategy and a possible GPT-5 launch in September. The Gemini 2.5 paper was released with 3,295 authors, and Google introduced its Gemini Embedding model, topping the MTEB leaderboard.

Apr 05

not much happened today

o3 o4-mini gpt-5 sonnet-3.7 gemma-3 qwen-2.5-vl gemini-2.5-pro gemma-7b llama-3-1-405b openai deepseek anthropic google meta-ai-fair inference-scaling reward-modeling coding-models ocr model-preview rate-limiting model-pricing architectural-advantage benchmarking long-form-reasoning attention-mechanisms mixture-of-experts gpu-throughput sama akhaliq nearcyan fchollet reach_vb philschmid teortaxestex epochairesearch omarsar0

OpenAI announced that o3 and o4-mini models will be released soon, with GPT-5 expected in a few months, delayed for quality improvements and capacity planning. DeepSeek introduced Self-Principled Critique Tuning (SPCT) to enhance inference-time scalability for generalist reward models. Anthropic's Sonnet 3.7 remains a top coding model. Google's Gemma 3 is available on KerasHub, and Qwen 2.5 VL powers a new Apache 2.0 licensed OCR model. Gemini 2.5 Pro entered public preview with increased rate limits and pricing announced, becoming a preferred model for many tasks except image generation. Meta's architectural advantage and the FrontierMath benchmark challenge AI's long-form reasoning and worldview development. Research reveals LLMs focus attention on the first token as an "attention sink," preserving representation diversity, demonstrated in Gemma 7B and LLaMa 3.1 models. MegaScale-Infer offers efficient serving of large-scale Mixture-of-Experts models with up to 1.90x higher per-GPU throughput.

Feb 13

small news items

gpt-4.5 gpt-5 deepseek-r1-distilled-qwen-1.5b o1-preview modernbert-0.3b qwen-0.5b o3 openai ollama mistral perplexity cerebras alibaba groq bytedance math benchmarking fine-tuning model-performance reinforcement-learning model-architecture partnerships funding jeremyphoward arankomatsuzaki sama nrehiew_ danhendrycks akhaliq

OpenAI announced plans for GPT-4.5 (Orion) and GPT-5, with GPT-5 integrating the o3 model and offering unlimited chat access in the free tier. DeepSeek R1 Distilled Qwen 1.5B outperforms OpenAI's o1-preview on math benchmarks, while ModernBERT 0.3b surpasses Qwen 0.5b at MMLU without fine-tuning. Mistral and Perplexity adopt Cerebras hardware for 10x performance gains. OpenAI's o3 model won a gold medal at the 2024 International Olympiad in Informatics. Partnerships include Qwen with Groq. Significant RLHF activity is noted in Nigeria and the global south, and Bytedance is expected to rise in AI prominence soon. "GPT5 is all you need."

Jan 18

not much happened today

deepseek-v3 llama-3-1-405b gpt-4o gpt-5 minimax-01 claude-3-haiku cosmos-nemotron-34b openai deep-learning-ai meta-ai-fair google-deepmind saama langchain nvidia mixture-of-experts coding math scaling visual-tokenizers diffusion-models inference-time-scaling retrieval-augmented-generation ai-export-restrictions security-vulnerabilities prompt-injection gpu-optimization fine-tuning personalized-medicine clinical-trials ai-agents persistent-memory akhaliq

DeepSeek-V3, a 671 billion parameter mixture-of-experts model, surpasses Llama 3.1 405B and GPT-4o in coding and math benchmarks. OpenAI announced the upcoming release of GPT-5 on April 27, 2023. MiniMax-01 Coder mode in ai-gradio enables building a chess game in one shot. Meta research highlights trade-offs in scaling visual tokenizers. Google DeepMind improves diffusion model quality via inference-time scaling. The RA-DIT method fine-tunes LLMs and retrievers for better RAG responses. The U.S. proposes a three-tier export restriction system on AI chips and models, excluding countries like China and Russia. Security vulnerabilities in AI chatbots involving CSRF and prompt injection were revealed. Concerns about superintelligence and weapons-grade AI models were expressed. ai-gradio updates include NVIDIA NIM compatibility and new models like cosmos-nemotron-34b. LangChain integrates with Claude-3-haiku for AI agents with persistent memory. Triton Warp specialization optimizes GPU usage for matrix multiplication. Meta's fine-tuned Llama models, OpenBioLLM-8B and OpenBioLLM-70B, target personalized medicine and clinical trials.

May 28, 2024

Life after DPO (RewardBench)

gpt-3 gpt-4 gpt-5 gpt-6 llama-3-8b llama-3 claude-3 gemini x-ai openai mistral-ai anthropic cohere meta-ai-fair hugging-face nvidia reinforcement-learning-from-human-feedback direct-preference-optimization reward-models rewardbench language-model-history model-evaluation alignment-research preference-datasets personalization transformer-architecture nathan-lambert chris-manning elon-musk bindureddy rohanpaul_ai nearcyan

xAI raised $6 billion at a $24 billion valuation, positioning it among the most highly valued AI startups, with expectations to fund GPT-5 and GPT-6 class models. The RewardBench tool, developed by Nathan Lambert, evaluates reward models (RMs) for language models, showing Cohere's RMs outperforming open-source alternatives. The discussion highlights the evolution of language models from Claude Shannon's 1948 model to GPT-3 and beyond, emphasizing the role of RLHF (Reinforcement Learning from Human Feedback) and the newer DPO (Direct Preference Optimization) method. Notably, some Llama 3 8B reward model-focused models are currently outperforming GPT-4, Cohere, Gemini, and Claude on the RewardBench leaderboard, raising questions about reward hacking. Future alignment research directions include improving preference datasets, DPO techniques, and personalization in language models. The report also compares xAI's valuation with OpenAI, Mistral AI, and Anthropic, noting speculation about xAI's spending on Nvidia hardware.

May 07, 2024

Kolmogorov-Arnold Networks: MLP killers or just spicy MLPs?

gpt-5 gpt-4 dall-e-3 openai microsoft learnable-activations mlp function-approximation interpretability inductive-bias-injection b-splines model-rearrangement parameter-efficiency ai-generated-image-detection metadata-standards large-model-training max-tegmark ziming-liu bindureddy nptacek zacharynado rohanpaul_ai svpino

Ziming Liu, a grad student of Max Tegmark, published a paper on Kolmogorov-Arnold Networks (KANs), claiming they outperform MLPs in interpretability, inductive bias injection, function approximation accuracy, and scaling, despite being 10x slower to train but 100x more parameter efficient. KANs use learnable activation functions modeled by B-splines on edges rather than fixed activations on nodes. However, it was later shown that KANs can be mathematically rearranged back into MLPs with similar parameter counts, sparking debate on their interpretability and novelty. Meanwhile, on AI Twitter, there is speculation about a potential GPT-5 release with mixed impressions, OpenAI's adoption of the C2PA metadata standard for detecting AI-generated images with high accuracy for DALL-E 3, and Microsoft training a large 500B parameter model called MAI-1, potentially previewed at Build conference, signaling increased competition with OpenAI. "OpenAI's safety testing for GPT-4.5 couldn't finish in time for Google I/O launch" was also noted.

May 02, 2024

Evals: The Next Generation

gpt-4 gpt-5 gpt-3.5 phi-3 mistral-7b llama-3 scale-ai mistral-ai reka-ai openai moderna sanctuary-ai microsoft mit meta-ai-fair benchmarking data-contamination multimodality fine-tuning ai-regulation ai-safety ai-weapons neural-networks model-architecture model-training model-performance robotics activation-functions long-context sam-altman jim-fan

Scale AI highlighted issues with data contamination in benchmarks like MMLU and GSM8K, proposing a new benchmark where Mistral overfits and Phi-3 performs well. Reka released the VibeEval benchmark for multimodal models addressing multiple choice benchmark limitations. Sam Altman of OpenAI discussed GPT-4 as "dumb" and hinted at GPT-5 with AI agents as a major breakthrough. Researchers jailbroke GPT-3.5 via fine-tuning. Global calls emerged to ban AI-powered weapons, with US officials urging human control over nuclear arms. Ukraine launched an AI consular avatar, while Moderna partnered with OpenAI for medical AI advancements. Sanctuary AI and Microsoft collaborate on AI for general-purpose robots. MIT introduced Kolmogorov-Arnold networks with improved neural network efficiency. Meta AI is training Llama 3 models with over 400 billion parameters, featuring multimodality and longer context.

Apr 15, 2024

Multi-modal, Multi-Aspect, Multi-Form-Factor AI

gpt-4 idefics-2-8b mistral-instruct apple-mlx gpt-5 reka-ai cohere google rewind apple mistral-ai microsoft paypal multimodality foundation-models embedding-models gpu-performance model-comparison enterprise-data open-source performance-optimization job-impact agi-criticism technical-report arthur-mensch dan-schulman chris-bishop

Between April 12-15, Reka Core launched a new GPT4-class multimodal foundation model with a detailed technical report described as "full Shazeer." Cohere Compass introduced a foundation embedding model for indexing and searching multi-aspect enterprise data like emails and invoices. The open-source IDEFICS 2-8B model continues Google's Flamingo multimodal model reproduction. Rewind pivoted to a multi-platform app called Limitless, moving away from spyware. Reddit discussions highlighted Apple MLX outperforming Ollama and Mistral Instruct on M2 Ultra GPUs, GPU choices for LLMs and Stable Diffusion, and AI-human comparisons by Microsoft Research's Chris Bishop. Former PayPal CEO Dan Schulman predicted GPT-5 will drastically reduce job scopes by 80%. Mistral CEO Arthur Mensch criticized the obsession with AGI as "creating God."

Mar 20, 2024

World_sim.exe

gpt-4 gpt-4o grok-1 llama-cpp claude-3-opus claude-3 gpt-5 nvidia nous-research stability-ai hugging-face langchain anthropic openai multimodality foundation-models hardware-optimization model-quantization float4 float6 retrieval-augmented-generation text-to-video prompt-engineering long-form-rag gpu-optimization philosophy-of-ai agi-predictions jensen-huang yann-lecun sam-altman

NVIDIA announced Project GR00T, a foundation model for humanoid robot learning using multimodal instructions, built on their tech stack including Isaac Lab, OSMO, and Jetson Thor. They revealed the DGX Grace-Blackwell GB200 with over 1 exaflop compute, capable of training GPT-4 1.8T parameters in 90 days on 2000 Blackwells. Jensen Huang confirmed GPT-4 has 1.8 trillion parameters. The new GB200 GPU supports float4/6 precision with ~3 bits per parameter and achieves 40,000 TFLOPs on fp4 with 2x sparsity. Open source highlights include the release of Grok-1, a 340B parameter model, and Stability AI's SV3D, an open-source text-to-video generation solution. Nous Research collaborated on implementing Steering Vectors in Llama.CPP. In Retrieval Augmented Generation (RAG), a new 5.5-hour tutorial builds a pipeline using open-source HF models, and LangChain released a video on query routing and announced integration with NVIDIA NIM for GPU-optimized LLM inference. Prominent opinions include Yann LeCun distinguishing language from other cognitive abilities, Sam Altman predicting AGI arrival in 6 years with a leap from GPT-4 to GPT-5 comparable to GPT-3 to GPT-4, and discussions on the philosophical status of LLMs like Claude. There is also advice against training models from scratch for most companies.

Jan 22, 2024

Sama says: GPT-5 soon

gpt-5 mixtral-7b gpt-3.5 gemini-pro gpt-4 llama-cpp openai codium thebloke amd hugging-face mixture-of-experts fine-tuning model-merging 8-bit-optimization gpu-acceleration performance-comparison command-line-ai vector-stores embeddings coding-capabilities sam-altman ilya-sutskever itamar andrej-karpathy

Sam Altman at Davos highlighted that his top priority is launching the new model, likely called GPT-5, while expressing uncertainty about Ilya Sutskever's employment status. Itamar from Codium introduced the concept of Flow Engineering with AlphaCodium, gaining attention from Andrej Karpathy. On the TheBloke Discord, engineers discussed a multi-specialty mixture-of-experts (MOE) model combining seven distinct 7 billion parameter models specialized in law, finance, and medicine. Debates on 8-bit fine-tuning and the use of bitsandbytes with GPU support were prominent. Discussions also covered model merging using tools like Mergekit and compatibility with Alpaca format. Interest in optimizing AI models on AMD hardware using AOCL blas and lapack libraries with llama.cpp was noted. Users experimented with AI for command line tasks, and the Mixtral MoE model was refined to surpass larger models in coding ability. Comparisons among LLMs such as GPT-3.5, Mixtral, Gemini Pro, and GPT-4 focused on knowledge depth, problem-solving, and speed, especially for coding tasks.

Dec 16, 2023

12/16/2023: ByteDance suspended by OpenAI

claude-2.1 gpt-4-turbo gemini-1.5-pro gpt-5 gpt-4.5 gpt-4 openai google-deepmind anthropic hardware gpu api-costs coding model-comparison subscription-issues payment-processing feature-confidentiality ai-art-generation organizational-productivity model-speculation

The OpenAI Discord community discussed hardware options like Mac racks and the A6000 GPU, highlighting their value for AI workloads. They compared Claude 2.1 and GPT 4 Turbo on coding tasks, with GPT 4 Turbo outperforming Claude 2.1. The benefits of the Bard API for gemini pro were noted, including a free quota of 60 queries per minute. Users shared experiences with ChatGPT Plus membership issues, payment problems, and speculated about the upcoming GPT-5 and the rumored GPT-4.5. Discussions also covered the confidentiality of the Alpha feature, AI art generation policies, and improvements in organizational work features. The community expressed mixed feelings about GPT-4's performance and awaited future model updates.

Dec 07, 2023

12/7/2023: Anthropic says "skill issue"

claude-2.1 gpt-4 gpt-3.5 gemini-pro gemini-ultra gpt-4.5 chatgpt bingchat dall-e gpt-5 anthropic openai google prompt-engineering model-performance regulation language-model-performance image-generation audio-processing midi-sequence-analysis subscription-issues network-errors

Anthropic fixed a glitch in their Claude 2.1 model's needle in a haystack test by adding a prompt. Discussions on OpenAI's Discord compared Google's Gemini Pro and Gemini Ultra models with OpenAI's GPT-4 and GPT-3.5, with some users finding GPT-4 superior in benchmarks. Rumors about a GPT-4.5 release circulated without official confirmation. Concerns were raised about "selective censorship" affecting language model performance. The EU's potential regulation of AI, including ChatGPT, was highlighted. Users reported issues with ChatGPT Plus message limits and subscription upgrades, and shared experiences with BingChat and DALL-E. The community discussed prompt engineering techniques and future applications like image generation and MIDI sequence analysis, expressing hopes for GPT-5.