All tags
Topic: "evaluation-metrics"
Clémentine Fourrier on LLM evals
claude-3-opus huggingface meta-ai-fair llm-evaluation automated-benchmarking human-evaluation model-bias data-contamination elo-ranking systematic-annotations preference-learning evaluation-metrics prompt-sensitivity clem_fourrier
Clémentine Fourrier from Huggingface presented at ICLR about GAIA with Meta and shared insights on LLM evaluation methods. The blog outlines three main evaluation approaches: Automated Benchmarking using sample inputs/outputs and metrics, Human Judges involving grading and ranking with methods like Vibe-checks, Arena, and systematic annotations, and Models as Judges using generalist or specialist models with noted biases. Challenges include data contamination, subjectivity, and bias in scoring. These evaluations help prevent regressions, rank models, and track progress in the field.