All tags
Company: "metr_evals"
not much happened today
gemini-3.1-pro gpt-5.2 opus-4.6 sonnet-4.6 claude-opus-4.6 google-deepmind anthropic context-arena artificial-analysis epoch-ai scaling01 retrieval benchmarking evaluation-methodology token-limits cost-efficiency instruction-following software-reasoning model-reliability dillonuzar artificialanlys yuchenj_uw theo minimax_ai epochairesearch paul_cal scaling01 metr_evals idavidrein xlr8harder htihle arena
Gemini 3.1 Pro demonstrates strong retrieval capabilities and cost efficiency compared to GPT-5.2 and Opus 4.6, though users report tooling and UI issues. The SWE-bench Verified evaluation methodology is under scrutiny for consistency, with updates bringing results closer to developer claims. Benchmarking debates arise over what frontier models truly measure, especially with ARC-AGI puzzles. Claude Opus 4.6 shows a noisy but notable 14.5-hour time horizon on software tasks, with token limits causing practical failures. Sonnet 4.6 improves significantly in code and instruction-following benchmarks, but user backlash grows due to product regressions.
AI Engineer Code Summit
gemini-3-pro-image gemini-3 gpt-5 claude-3.7-sonnet google-deepmind togethercompute image-generation fine-tuning benchmarking agentic-ai physics model-performance instruction-following model-comparison time-horizon user-preference demishassabis omarsar0 lintool hrishioa teknium artificialanlys minyangtian1 ofirpress metr_evals scaling01
The recent AIE Code Summit showcased key developments including Google DeepMind's Gemini 3 Pro Image model, Nano Banana Pro, which features enhanced text rendering, 4K visuals, and fine-grained editing capabilities. Community feedback highlights its strong performance in design and visualization tasks, with high user preference scores. Benchmarking updates reveal the new CritPt physics frontier benchmark where Gemini 3 Pro outperforms GPT-5, though AI still lags on complex unseen research problems. Agentic task evaluations show varied time horizons and performance gaps between open-weight and closed frontier models, emphasizing ongoing challenges in AI research and deployment. "Instruction following remains jagged for some users," and model fit varies by use case, with Gemini 3 excelling in UI and code tasks but showing regressions in transcription and writing fidelity.
not much happened today
chatgpt-4o deepseek-r1 o3 o3-mini gemini-2-flash qwen-2.5 qwen-0.5b hugging-face openai perplexity-ai deepseek-ai gemini qwen metr_evals reasoning benchmarking model-performance prompt-engineering model-optimization model-deployment small-language-models mobile-ai ai-agents speed-optimization _akhaliq aravsrinivas lmarena_ai omarsar0 risingsayak
Smolagents library by Huggingface continues trending. ChatGPT-4o latest version
chatgpt-40-latest-20250129 released. DeepSeek R1 671B sets speed record at 198 t/s, fastest reasoning model, recommended with specific prompt settings. Perplexity Deep Research outperforms models like Gemini Thinking, o3-mini, and DeepSeek-R1 on Humanity's Last Exam benchmark with 21.1% score and 93.9% accuracy on SimpleQA. ChatGPT-4o ranks #1 on Arena leaderboard in multiple categories except math. OpenAI's o3 model powers Deep Research tool for ChatGPT Pro users. Gemini 2 Flash and Qwen 2.5 models support LLMGrading verifier. Qwen 2.5 models added to PocketPal app. MLX shows small LLMs like Qwen 0.5B generate tokens at high speed on M4 Max and iPhone 16 Pro. Gemini Flash 2.0 leads new AI agent leaderboard. DeepSeek R1 is most liked on Hugging Face with over 10 million downloads.