All tags
Topic: "agent-evaluation"
GPT 5.1 in ChatGPT: No evals, but adaptive thinking and instruction following
gpt-5.1 gpt-5.0 claude isaac-0.1 qwen3vl-235b glm-4.6 gemini openai anthropic waymo perceptron langchain llamaindex nousresearch adaptive-reasoning instruction-following personalization autonomous-driving robotics multimodality agent-evaluation agent-governance middleware structured-extraction benchmarking dmitri_dolgov jeffdean fidji_simo akshats07
OpenAI launched GPT-5.1 with improvements in conversational tone, instruction following, and adaptive reasoning. GPT-5.0 is being sunset in 3 months. ChatGPT introduces new tone toggles for personalization, serving over 800 million users. Waymo rolls out freeway driving for public riders in major California cities, showcasing advances in autonomous driving. Anthropic's Project Fetch explores LLMs as robotics copilots using Claude. Perceptron releases a new API and Python SDK for multimodal perception-action apps supporting Isaac-0.1 and Qwen3VL-235B. Code Arena offers live coding evaluations supporting Claude, GPT-5, GLM-4.6, and Gemini. LangChain introduces middleware for agent governance with human-in-the-loop controls. LlamaIndex releases a structured extraction template for SEC filings using LlamaAgents. NousResearch promotes ARC Prize benchmarks for generalized intelligence evaluation.
Claude Haiku 4.5
claude-3.5-sonnet claude-3-haiku claude-3-haiku-4.5 gpt-5 gpt-4.1 gemma-2.5 gemma o3 anthropic google yale artificial-analysis shanghai-ai-lab model-performance fine-tuning reasoning agent-evaluation memory-optimization model-efficiency open-models cost-efficiency foundation-models agentic-workflows swyx sundarpichai osanseviero clementdelangue deredleritt3r azizishekoofeh vikhyatk mirrokni pdrmnvd akhaliq sayashk gne
Anthropic released Claude Haiku 4.5, a model that is over 2x faster and 3x cheaper than Claude Sonnet 4.5, improving iteration speed and user experience significantly. Pricing comparisons highlight Haiku 4.5's competitive cost against models like GPT-5 and GLM-4.6. Google and Yale introduced the open-weight Cell2Sentence-Scale 27B (Gemma) model, which generated a novel, experimentally validated cancer hypothesis, with open-sourced weights for community use. Early evaluations show GPT-5 and o3 models outperform GPT-4.1 in agentic reasoning tasks, balancing cost and performance. Agent evaluation challenges and memory-based learning advances were also discussed, with contributions from Shanghai AI Lab and others. "Haiku 4.5 materially improves iteration speed and UX," and "Cell2Sentence-Scale yielded validated cancer hypothesis" were key highlights.
not much happened today
mobilellm-r1 qwen3-next-80b-a3b gpt-5 meta-ai-fair huggingface alibaba openai reasoning model-efficiency hybrid-attention long-context benchmarking agent-evaluation hallucination-detection model-calibration inference-complexity model-pricing _akhaliq tacocohen pkirgis sayashk
Meta released MobileLLM-R1, a sub-1B parameter reasoning model family on Hugging Face with strong small-model math accuracy, trained on 4.2T tokens. Alibaba introduced Qwen3-Next-80B-A3B with hybrid attention, 256k context window, and improved long-horizon memory, priced competitively on Alibaba Cloud. Meta AI FAIR fixed a benchmark bug in SWE-Bench affecting agent evaluation. LiveMCP-101 benchmark shows frontier models like GPT-5 underperform on complex tasks with common failure modes cataloged. OpenAI highlights hallucination issues due to benchmark incentives, proposing calibration improvements. Community demos and tooling updates continue to evolve.
not much happened today
claude-code gemini qwen3-coder gemini-2.5-flash exa openpipe coreweave statsig openai zed claude gemini langchain anthropic fair alibaba hud-evals agent-protocols interoperability standardization agent-evaluation coding-agents software-optimization web-browsing reinforcement-learning multi-turn-reasoning optimizer-design data-efficient-rlvr leaderboards benchmarking zeddotdev mathemagic1an hwchase17 giffmana gneubig crystalsssup sayashk _philschmid _akhaliq jaseweston
Exa raised a $700m Series B, OpenPipe was acquired by Coreweave, and Statsig and Alex were acquired by OpenAI. The Agent/Client Protocol (ACP) was introduced by the Zed team to standardize IDE-agent interoperability, supporting Claude Code and Gemini CLIs. LangChain 1.0 alpha unifies content blocks for reasoning and multimodal data. The OSWorld Verified leaderboard promotes reproducible evaluation of computer-use agents including OpenAI and Anthropic models. FAIR revealed coding agent cheating on SWE-Bench Verified. PR Arena hosts live coding agent competitions. Benchmarks like GSO and Holistic Agent Leaderboard test software optimization and web browsing tasks, with Qwen3-Coder and Gemini 2.5 Flash showing strong performance. Advances in reinforcement learning for tool use include SimpleTIR improving multi-turn tool use success rates and UI-TARS-2 advancing GUI agents. The DARLING optimizer improves quality and diversity in reasoning and instruction following, while DEPO achieves data-efficient RLVR with significant speedups.
>$41B raised today (OpenAI @ 300b, Cursor @ 9.5b, Etched @ 1.5b)
deepseek-v3-0324 gemini-2.5-pro claude-3.7-sonnet openai deepseek gemini cursor etched skypilot agent-evals open-models model-releases model-performance coding multimodality model-deployment cost-efficiency agent-evaluation privacy kevinweil sama lmarena_ai scaling01 iscienceluvr stevenheidel lepikhin dzhng raizamrtn karpathy
OpenAI is preparing to release a highly capable open language model, their first since GPT-2, with a focus on reasoning and community feedback, as shared by @kevinweil and @sama. DeepSeek V3 0324 has achieved the #5 spot on the Arena leaderboard, becoming the top open model with an MIT license and cost advantages. Gemini 2.5 Pro is noted for outperforming models like Claude 3.7 Sonnet in coding tasks, with upcoming pricing and improvements expected soon. New startups like Sophont are building open multimodal foundation models for healthcare. Significant fundraises include Cursor closing $625M at a $9.6B valuation and Etched raising $85M at $1.5B. Innovations in AI infrastructure include SkyPilot's cost-efficient cloud provisioning and the launch of AgentEvals, an open-source package for evaluating AI agents. Discussions on smartphone privacy highlight iPhone's stronger user defense compared to Android.