All tags
Topic: "evaluation-metrics"
Not much happened today
mistral-small-3.2 magenta-realtime afm-4.5b llama-3 openthinker3-7b deepseek-r1-distill-qwen-7b storm qwen2-vl gpt-4o dino-v2 sakana-ai mistral-ai google arcee-ai deepseek-ai openai amazon gdm reinforcement-learning chain-of-thought fine-tuning function-calling quantization music-generation foundation-models reasoning text-video model-compression image-classification evaluation-metrics sama
Sakana AI released Reinforcement-Learned Teachers (RLTs), a novel technique using smaller 7B parameter models trained via reinforcement learning to teach reasoning through step-by-step explanations, accelerating Chain-of-Thought learning. Mistral AI updated Mistral Small 3.2 improving instruction following and function calling with experimental FP8 quantization. Google Magenta RealTime, an 800M parameter open-weights model for real-time music generation, was released. Arcee AI launched AFM-4.5B, a sub-10B parameter foundation model extended from Llama 3. OpenThinker3-7B was introduced as a new state-of-the-art 7B reasoning model with a 33% improvement over DeepSeek-R1-Distill-Qwen-7B. The STORM text-video model compresses video input by 8x using Mamba layers and outperforms GPT-4o on MVBench with 70.6%. Discussions on reinforcement learning algorithms PPO vs. GRPO and insights on DINOv2's performance on ImageNet-1k were also highlighted. "A very quiet day" in AI news with valuable workshops from OpenAI, Amazon, and GDM.
Clémentine Fourrier on LLM evals
claude-3-opus huggingface meta-ai-fair llm-evaluation automated-benchmarking human-evaluation model-bias data-contamination elo-ranking systematic-annotations preference-learning evaluation-metrics prompt-sensitivity clem_fourrier
Clémentine Fourrier from Huggingface presented at ICLR about GAIA with Meta and shared insights on LLM evaluation methods. The blog outlines three main evaluation approaches: Automated Benchmarking using sample inputs/outputs and metrics, Human Judges involving grading and ranking with methods like Vibe-checks, Arena, and systematic annotations, and Models as Judges using generalist or specialist models with noted biases. Challenges include data contamination, subjectivity, and bias in scoring. These evaluations help prevent regressions, rank models, and track progress in the field.