All tags
Topic: "performance-evaluation"
OpenEvidence, the βChatGPT for doctors,β raises $250m at $12B valuation, 12x from $1b last Feb
claude claude-3 claude-opus gpt-5.2 gemini-3-flash-high openevidence anthropic podium openai google gemini agentic-ai model-alignment performance-evaluation memory-optimization long-context benchmarking multi-agent-systems reinforcement-learning daniel_nadler amanda_askell eric_rea tom_loverro garry_tan omarsar0 brendanfoody deredleritt3r
OpenEvidence raised $12 billion, a 12x increase from last year, with usage by 40% of U.S. physicians and over $100 million in annual revenue. Anthropic released a new Claude model constitution under CC0 1.0, framing it as a living document for alignment and training. Podium reported over $100 million ARR from 10,000+ AI agents, shifting from software sales to AI operators. Innovations in agent memory and reliability include the Agent Cognitive Compressor (ACC) and multi-agent scientific workflows via MCP-SIM. Agentic benchmarking shows challenges in long-horizon tasks with models like Gemini 3 Flash High, GPT-5.2 High, and Claude Opus 4.5 High scoring modestly on professional services and legal research benchmarks.
MiniMax M2 230BA10B β 8% of Claude Sonnet's price, ~2x faster, new SOTA open model
minimax-m2 hailuo-ai huggingface baseten vllm modelscope openrouter cline sparse-moe model-benchmarking model-architecture instruction-following tool-use api-pricing model-deployment performance-evaluation full-attention qk-norm gqa rope reach_vb artificialanlys akhaliq eliebakouch grad62304977 yifan_zhang_ zpysky1125
MiniMax M2, an open-weight sparse MoE model by Hailuo AI, launches with β200β230B parameters and 10B active parameters, offering strong performance near frontier closed models and ranking #5 overall on the Artificial Analysis Intelligence Index v3.0. It supports coding and agent tasks, is licensed under MIT, and is available via API at competitive pricing. The architecture uses full attention, QK-Norm, GQA, partial RoPE, and sigmoid routing, with day-0 support in vLLM and deployment on platforms like Hugging Face and Baseten. Despite verbosity and no tech report, it marks a significant win for open models.
ChatGPT responds to GlazeGate + LMArena responds to Cohere
qwen3-235b-a22b qwen3 qwen3-moe llama-4 openai cohere lm-arena deepmind x-ai meta-ai-fair alibaba vllm llamaindex model-releases model-benchmarking performance-evaluation open-source multilinguality model-integration fine-tuning model-optimization joannejang arankomatsuzaki karpathy sarahookr reach_vb
OpenAI faced backlash after a controversial ChatGPT update, leading to an official retraction admitting they "focused too much on short-term feedback." Researchers from Cohere published a paper criticizing LMArena for unfair practices favoring incumbents like OpenAI, DeepMind, X.ai, and Meta AI Fair. The Qwen3 family by Alibaba was released, featuring models up to 235B MoE, supporting 119 languages and trained on 36 trillion tokens, with integration into vLLM and support in tools like llama.cpp. Meta announced the second round of Llama Impact Grants to promote open-source AI innovation. Discussions on AI Twitter highlighted concerns about leaderboard overfitting and fairness in model benchmarking, with notable commentary from karpathy and others.