All tags
Company: "metr_evals"
AI Engineer Code Summit
gemini-3-pro-image gemini-3 gpt-5 claude-3.7-sonnet google-deepmind togethercompute image-generation fine-tuning benchmarking agentic-ai physics model-performance instruction-following model-comparison time-horizon user-preference demishassabis omarsar0 lintool hrishioa teknium artificialanlys minyangtian1 ofirpress metr_evals scaling01
The recent AIE Code Summit showcased key developments including Google DeepMind's Gemini 3 Pro Image model, Nano Banana Pro, which features enhanced text rendering, 4K visuals, and fine-grained editing capabilities. Community feedback highlights its strong performance in design and visualization tasks, with high user preference scores. Benchmarking updates reveal the new CritPt physics frontier benchmark where Gemini 3 Pro outperforms GPT-5, though AI still lags on complex unseen research problems. Agentic task evaluations show varied time horizons and performance gaps between open-weight and closed frontier models, emphasizing ongoing challenges in AI research and deployment. "Instruction following remains jagged for some users," and model fit varies by use case, with Gemini 3 excelling in UI and code tasks but showing regressions in transcription and writing fidelity.
not much happened today
chatgpt-4o deepseek-r1 o3 o3-mini gemini-2-flash qwen-2.5 qwen-0.5b hugging-face openai perplexity-ai deepseek-ai gemini qwen metr_evals reasoning benchmarking model-performance prompt-engineering model-optimization model-deployment small-language-models mobile-ai ai-agents speed-optimization _akhaliq aravsrinivas lmarena_ai omarsar0 risingsayak
Smolagents library by Huggingface continues trending. ChatGPT-4o latest version
chatgpt-40-latest-20250129 released. DeepSeek R1 671B sets speed record at 198 t/s, fastest reasoning model, recommended with specific prompt settings. Perplexity Deep Research outperforms models like Gemini Thinking, o3-mini, and DeepSeek-R1 on Humanity's Last Exam benchmark with 21.1% score and 93.9% accuracy on SimpleQA. ChatGPT-4o ranks #1 on Arena leaderboard in multiple categories except math. OpenAI's o3 model powers Deep Research tool for ChatGPT Pro users. Gemini 2 Flash and Qwen 2.5 models support LLMGrading verifier. Qwen 2.5 models added to PocketPal app. MLX shows small LLMs like Qwen 0.5B generate tokens at high speed on M4 Max and iPhone 16 Pro. Gemini Flash 2.0 leads new AI agent leaderboard. DeepSeek R1 is most liked on Hugging Face with over 10 million downloads.