All tags
Person: "polynoamial"
Execuhires Round 2: Scale-Meta, Lamini-AMD, and Instacart-OpenAI
o3-pro o3 o1-pro gpt-4o gpt-4.1 gpt-4.1-mini gpt-4.1-nano meta-ai-fair scale-ai lamini amd openai gemini google anthropic model-release benchmarking reasoning fine-tuning pricing model-performance direct-preference-optimization complex-problem-solving alexandr_wang sharon_zhou fidji_simo sama jack_rae markchen90 kevinweil gdb gregkamradt lechmazur wesrothmoney paul_cal imjaredz cto_junior johnowhitaker polynoamial scaling01
Meta hires Scale AI's Alexandr Wang to lead its new "Superintelligence" division following a $15 billion investment for a 49% stake in Scale. Lamini's Sharon Zhou joins AMD as VP of AI under Lisa Su, while Instacart's Fidji Simo becomes CEO of Apps at OpenAI under Sama. Meta offers over $10 million/year compensation packages to top researchers, successfully recruiting Jack Rae from Gemini. OpenAI releases o3-pro model to ChatGPT Pro users and API, outperforming o3 and setting new benchmarks like Extended NYT Connections and SnakeBench. Despite being slower than o1-pro, o3-pro excels in reasoning and complex problem-solving. OpenAI cuts o3 pricing by 80%, making it cheaper than GPT-4o and pressuring competitors like Google and Anthropic to lower prices. Users can now fine-tune the GPT-4.1 family using direct preference optimization (DPO) for subjective tasks.
Reasoning Price War 2: Mistral Magistral + o3's 80% price cut + o3-pro
o3 o3-pro gpt-4.1 claude-4-sonnet gemini-2.5-pro magistral-small magistral-medium mistral-small-3.1 openai anthropic google-deepmind mistral-ai perplexity-ai reasoning token-efficiency price-cut benchmarking open-source model-releases context-windows gpu-optimization swyx sama scaling01 polynoamial nrehiew_ kevinweil gdb flavioad stevenheidel aravsrinivas
OpenAI announced an 80% price cut for its o3 model, making it competitively priced with GPT-4.1 and rivaling Anthropic's Claude 4 Sonnet and Google's Gemini 2.5 Pro. Alongside, o3-pro was released as a more powerful and reliable variant, though early benchmarks showed mixed performance relative to cost. Mistral AI launched its Magistral reasoning models, including an open-source 24B parameter version optimized for efficient deployment on consumer GPUs. The price reduction and new model releases signal intensified competition in reasoning-focused large language models, with notable improvements in token efficiency and cost-effectiveness.
Gemini 2.5 Flash completes the total domination of the Pareto Frontier
gemini-2.5-flash o3 o4-mini google openai anthropic tool-use multimodality benchmarking reasoning reinforcement-learning open-source model-releases chain-of-thought coding-agent sama kevinweil markchen90 alexandr_wang polynoamial scaling01 aidan_mclau cwolferesearch
Gemini 2.5 Flash is introduced with a new "thinking budget" feature offering more control compared to Anthropic and OpenAI models, marking a significant update in the Gemini series. OpenAI launched o3 and o4-mini models, emphasizing advanced tool use capabilities and multimodal understanding, with o3 dominating several leaderboards but receiving mixed benchmark reviews. The importance of tool use in AI research and development is highlighted, with OpenAI Codex CLI announced as a lightweight open-source coding agent. The news reflects ongoing trends in AI model releases, benchmarking, and tool integration.
OpenAI o3, o4-mini, and Codex CLI
o3 o4-mini gemini-2.5-pro claude-3-sonnet chatgpt openai reinforcement-learning performance vision tool-use open-source coding-agents model-benchmarking multimodality scaling inference sama aidan_mclau markchen90 gdb aidan_clark_ kevinweil swyx polynoamial scaling01
OpenAI launched the o3 and o4-mini models, emphasizing improvements in reinforcement-learning scaling and overall efficiency, making o4-mini cheaper and better across prioritized metrics. These models showcase enhanced vision and tool use capabilities, though API access for these features is pending. The release includes Codex CLI, an open-source coding agent that integrates with these models to convert natural language into working code. Accessibility extends to ChatGPT Plus, Pro, and Team users, with o3 being notably more expensive than Gemini 2.5 Pro. Performance benchmarks highlight the intelligence gains from scaling inference, with comparisons against models like Sonnet and Gemini. The launch has been well received despite some less favorable evaluation results.
QwQ-32B claims to match DeepSeek R1-671B
qwen-2.5-plus qwq-32b deepseek-r1 gpt-4.5 gpt-3 davinci alibaba openai deepseek-ai reinforcement-learning math code-execution instruction-following alignment reasoning model-release model-benchmarking scaling performance inference-costs aidan_mclau sama scaling01 juberti polynoamial reach_vb
Alibaba Qwen released their QwQ-32B model, a 32 billion parameter reasoning model using a novel two-stage reinforcement learning approach: first scaling RL for math and coding tasks with accuracy verifiers and code execution servers, then applying RL for general capabilities like instruction following and alignment. Meanwhile, OpenAI rolled out GPT-4.5 to Plus users, with mixed feedback on coding performance and noted inference cost improvements. The QwQ model aims to compete with larger MoE models like DeepSeek-R1. "GPT-4.5 is unusable for coding" was a notable user critique, while others praised its reasoning improvements due to scaling pretraining.
GPT 4.1: The New OpenAI Workhorse
gpt-4.1 gpt-4.1-mini gpt-4.1-nano gpt-4o gemini-2.5-pro openai llama-index perplexity-ai google-deepmind coding instruction-following long-context benchmarks model-pricing model-integration model-deprecation sama kevinweil omarsar0 aidan_mclau danhendrycks polynoamial scaling01 aravsrinivas lmarena_ai
OpenAI released GPT-4.1, including GPT-4.1 mini and GPT-4.1 nano, highlighting improvements in coding, instruction following, and handling long contexts up to 1 million tokens. The model achieves a 54 score on SWE-bench verified and shows a 60% improvement over GPT-4o on internal benchmarks. Pricing for GPT-4.1 nano is notably low at $0.10/1M input and $0.40/1M output. GPT-4.5 Preview is being deprecated in favor of GPT-4.1. Integration support includes Llama Index with day 0 support. Some negative feedback was noted for GPT-4.1 nano. Additionally, Perplexity's Sonar API ties with Gemini-2.5 Pro for the top spot in the LM Search Arena leaderboard. New benchmarks like MRCR and GraphWalks were introduced alongside updated prompting guides and cookbooks.