All tags  
  Topic: "agent-evaluation"
 Claude Haiku 4.5 
   claude-3.5-sonnet  claude-3-haiku  claude-3-haiku-4.5  gpt-5  gpt-4.1  gemma-2.5  gemma  o3   anthropic  google  yale  artificial-analysis  shanghai-ai-lab   model-performance  fine-tuning  reasoning  agent-evaluation  memory-optimization  model-efficiency  open-models  cost-efficiency  foundation-models  agentic-workflows   swyx  sundarpichai  osanseviero  clementdelangue  deredleritt3r  azizishekoofeh  vikhyatk  mirrokni  pdrmnvd  akhaliq  sayashk  gne  
 Anthropic released Claude Haiku 4.5, a model that is over 2x faster and 3x cheaper than Claude Sonnet 4.5, improving iteration speed and user experience significantly. Pricing comparisons highlight Haiku 4.5's competitive cost against models like GPT-5 and GLM-4.6. Google and Yale introduced the open-weight Cell2Sentence-Scale 27B (Gemma) model, which generated a novel, experimentally validated cancer hypothesis, with open-sourced weights for community use. Early evaluations show GPT-5 and o3 models outperform GPT-4.1 in agentic reasoning tasks, balancing cost and performance. Agent evaluation challenges and memory-based learning advances were also discussed, with contributions from Shanghai AI Lab and others. "Haiku 4.5 materially improves iteration speed and UX," and "Cell2Sentence-Scale yielded validated cancer hypothesis" were key highlights.
  not much happened today 
   mobilellm-r1  qwen3-next-80b-a3b  gpt-5   meta-ai-fair  huggingface  alibaba  openai   reasoning  model-efficiency  hybrid-attention  long-context  benchmarking  agent-evaluation  hallucination-detection  model-calibration  inference-complexity  model-pricing   _akhaliq  tacocohen  pkirgis  sayashk  
 Meta released MobileLLM-R1, a sub-1B parameter reasoning model family on Hugging Face with strong small-model math accuracy, trained on 4.2T tokens. Alibaba introduced Qwen3-Next-80B-A3B with hybrid attention, 256k context window, and improved long-horizon memory, priced competitively on Alibaba Cloud. Meta AI FAIR fixed a benchmark bug in SWE-Bench affecting agent evaluation. LiveMCP-101 benchmark shows frontier models like GPT-5 underperform on complex tasks with common failure modes cataloged. OpenAI highlights hallucination issues due to benchmark incentives, proposing calibration improvements. Community demos and tooling updates continue to evolve.
  not much happened today 
   claude-code  gemini  qwen3-coder  gemini-2.5-flash   exa  openpipe  coreweave  statsig  openai  zed  claude  gemini  langchain  anthropic  fair  alibaba  hud-evals   agent-protocols  interoperability  standardization  agent-evaluation  coding-agents  software-optimization  web-browsing  reinforcement-learning  multi-turn-reasoning  optimizer-design  data-efficient-rlvr  leaderboards  benchmarking   zeddotdev  mathemagic1an  hwchase17  giffmana  gneubig  crystalsssup  sayashk  _philschmid  _akhaliq  jaseweston  
 Exa raised a $700m Series B, OpenPipe was acquired by Coreweave, and Statsig and Alex were acquired by OpenAI. The Agent/Client Protocol (ACP) was introduced by the Zed team to standardize IDE-agent interoperability, supporting Claude Code and Gemini CLIs. LangChain 1.0 alpha unifies content blocks for reasoning and multimodal data. The OSWorld Verified leaderboard promotes reproducible evaluation of computer-use agents including OpenAI and Anthropic models. FAIR revealed coding agent cheating on SWE-Bench Verified. PR Arena hosts live coding agent competitions. Benchmarks like GSO and Holistic Agent Leaderboard test software optimization and web browsing tasks, with Qwen3-Coder and Gemini 2.5 Flash showing strong performance. Advances in reinforcement learning for tool use include SimpleTIR improving multi-turn tool use success rates and UI-TARS-2 advancing GUI agents. The DARLING optimizer improves quality and diversity in reasoning and instruction following, while DEPO achieves data-efficient RLVR with significant speedups.
  >$41B raised today (OpenAI @ 300b, Cursor @ 9.5b, Etched @ 1.5b) 
   deepseek-v3-0324  gemini-2.5-pro  claude-3.7-sonnet   openai  deepseek  gemini  cursor  etched  skypilot  agent-evals   open-models  model-releases  model-performance  coding  multimodality  model-deployment  cost-efficiency  agent-evaluation  privacy   kevinweil  sama  lmarena_ai  scaling01  iscienceluvr  stevenheidel  lepikhin  dzhng  raizamrtn  karpathy  
 OpenAI is preparing to release a highly capable open language model, their first since GPT-2, with a focus on reasoning and community feedback, as shared by @kevinweil and @sama. DeepSeek V3 0324 has achieved the #5 spot on the Arena leaderboard, becoming the top open model with an MIT license and cost advantages. Gemini 2.5 Pro is noted for outperforming models like Claude 3.7 Sonnet in coding tasks, with upcoming pricing and improvements expected soon. New startups like Sophont are building open multimodal foundation models for healthcare. Significant fundraises include Cursor closing $625M at a $9.6B valuation and Etched raising $85M at $1.5B. Innovations in AI infrastructure include SkyPilot's cost-efficient cloud provisioning and the launch of AgentEvals, an open-source package for evaluating AI agents. Discussions on smartphone privacy highlight iPhone's stronger user defense compared to Android.