Person: "yanndubs"

Dec 11, 2025

GPT-5.2 (Instant/Thinking/Pro): 74% on GDPVal, 1.4x cost of GPT 5.1, on 10 Year OpenAI Anniversary

gpt-5.2 openai scientific-reasoning knowledge-work long-context benchmarking performance-optimization pricing software-engineering vision sama yanndubs polynoamial scaling01

OpenAI celebrates its 10 year anniversary with the launch of GPT-5.2, featuring significant across-the-board improvements including a rare 40% price increase. GPT-5.2 shows strong performance gains in scientific reasoning, knowledge work, and economic value tasks, achieving over 70.9% human expert parity on GDPval tasks and reaching 90.5% on ARC-AGI-1 with a large efficiency gain. Despite some mixed results in coding benchmarks and vision capabilities, GPT-5.2 is well received as a major update with extended context and tiered reasoning controls. Pricing is set at $1.75/M input and $14/M output tokens with a 90% cache discount. The update is live in ChatGPT and API, marking a significant milestone for OpenAI's LLM development.

Nov 17, 2025

xAI Grok 4.1: #1 in Text Arena, #1 in EQ-bench, and better Creative Writing

grok-4.1 gpt-5.1 claude-4.1-opus grok-4 gpt-5 grok-4.1-thinking gpt-5-pro claude-4.5-haiku xai openai google-deepmind sakana-ai anthropic microsoft mufg khosla nea lux-capital iqt model-performance creative-writing hallucination evaluation-datasets ensemble-models weather-forecasting funding efficiency anti-hallucination arc-agi model-scaling yanndubs gregkamradt philschmid willccbb

xAI launched Grok 4.1, achieving a #1 rank on the LM Arena Text Leaderboard with an Elo score of 1483, showing improvements in creative writing and anti-hallucination. OpenAI's GPT-5.1 "Thinking" demonstrates efficiency gains with ~60% less "thinking" on easy queries and strong ARC-AGI performance. Google DeepMind released WeatherNext 2, an ensemble generative model that is 8× faster and more accurate for global weather forecasts, integrated into multiple Google products. Sakana AI raised ¥20B ($135M) in Series B funding at a $2.63B valuation to focus on efficient AI for resource-constrained enterprise applications in Japan. New evaluations highlight tradeoffs between hallucination and knowledge accuracy across models including Claude 4.1 Opus and Anthropic models.

Sep 01, 2025

not much happened today

gpt-5 grok-code-fast-1 claude-sonnet glm-4.5 longcat-flash-chat fastvlm mobileclip2 internvl3.5 openai x-ai zhipu-ai meituan apple model-architecture moe adaptive-compute inference-speed model-training cost-efficiency coding developer-tools open-inference on-device-ai vision gdb martin_casado yanndubs elonmusk cline vikhyatk dzhng quixiai tim_dettmers casper_hansen_ reach_vb eliebakouch teortaxestex youjiacheng

OpenAI integrates GPT-5 into Xcode 26 with improved coding latency, though some UX trade-offs are noted. xAI's Grok Code Fast 1 gains momentum, surpassing Claude Sonnet in usage and praised for fast debugging. Zhipu's GLM-4.5 offers a cost-effective coding plan with strong performance against Claude Sonnet 4. Meituan releases the LongCat-Flash-Chat, a 560B parameter MoE model with adaptive compute and detailed technical insights. Apple debuts on-device vision-language models FastVLM and MobileCLIP2 alongside InternVL3.5.

Aug 11, 2025

OpenAI's IMO Gold model also wins IOI Gold

gpt-5 gpt-5-thinking gpt-5-mini gemini-2.5-pro claude opus-4.1 openai google-deepmind anthropic reinforcement-learning benchmarking model-performance prompt-engineering model-behavior competitive-programming user-experience model-naming model-selection hallucination-detection sama scaling01 yanndubs sherylhsu ahmed_el-kishky jerry_tworek noam_brown alex_wei amandaaskell ericmitchellai jon_durbin gdb jerryjliu0

OpenAI announced placing #6 among human coders at the IOI, reflecting rapid progress in competitive coding AI over the past two years. The GPT-5 launch faced significant user backlash over restrictive usage limits and removal of model selection control, leading to a reversal and increased limits to 3000 requests per week for Plus users. Confusion around GPT-5 naming and benchmarking was highlighted, with critiques on methodological issues comparing models like Claude and Gemini. Performance reviews of GPT-5 are mixed, with claims of near-zero hallucinations by OpenAI staff but user reports of confidence in hallucinations and steering difficulties. Benchmarks show GPT-5 mini performing well on document understanding, while the full GPT-5 is seen as expensive and middling. On the Chatbot Arena, Gemini 2.5 Pro holds a 67% winrate against GPT-5 Thinking. Prompting and model behavior remain key discussion points.