All tags
Topic: "human-evaluation"
Talaria: Apple's new MLOps Superweapon
gemma mixtral phi dbrx apple google mistral-ai microsoft mosaic quantization on-device-ai adapter-models model-optimization model-latency lossless-quantization low-bit-palletization token-generation model-benchmarking human-evaluation craig-federighi andrej-karpathy
Apple Intelligence introduces a small (~3B parameters) on-device model and a larger server model running on Apple Silicon with Private Cloud Compute, aiming to surpass Google Gemma, Mistral Mixtral, Microsoft Phi, and Mosaic DBRX. The on-device model features a novel lossless quantization strategy using mixed 2-bit and 4-bit LoRA adapters averaging 3.5 bits-per-weight, enabling dynamic adapter hot-swapping and efficient memory management. Apple credits the Talaria tool for optimizing quantization and model latency, achieving about 0.6 ms time-to-first-token latency and 30 tokens per second generation rate on iPhone 15 Pro. Apple focuses on an "adapter for everything" strategy with initial deployment on SiriKit and App Intents. Performance benchmarks rely on human graders, emphasizing consumer-level adequacy over academic dominance. The Apple ML blog also mentions an Xcode code-focused model and a diffusion model for Genmoji.
Clémentine Fourrier on LLM evals
claude-3-opus huggingface meta-ai-fair llm-evaluation automated-benchmarking human-evaluation model-bias data-contamination elo-ranking systematic-annotations preference-learning evaluation-metrics prompt-sensitivity clem_fourrier
Clémentine Fourrier from Huggingface presented at ICLR about GAIA with Meta and shared insights on LLM evaluation methods. The blog outlines three main evaluation approaches: Automated Benchmarking using sample inputs/outputs and metrics, Human Judges involving grading and ranking with methods like Vibe-checks, Arena, and systematic annotations, and Models as Judges using generalist or specialist models with noted biases. Challenges include data contamination, subjectivity, and bias in scoring. These evaluations help prevent regressions, rank models, and track progress in the field.
Stable Diffusion 3 — Rombach & Esser did it again!
stable-diffusion-3 claude-3 orca dolphincoder-starcoder2-15b stability-ai anthropic microsoft latitude perplexity-ai llamaindex tripo-ai diffusion-models multimodality benchmarking human-evaluation text-generation image-generation 3d-modeling fine-tuning roleplay coding dataset-release soumith-chintala bill-peebles swyx kevinafischer jeremyphoward akhaliq karinanguyen_ aravsrinivas
Over 2500 new community members joined following Soumith Chintala's shoutout, highlighting growing interest in SOTA LLM-based summarization. The major highlight is the detailed paper release of Stable Diffusion 3 (SD3), showcasing advanced text-in-image control and complex prompt handling, with the model outperforming other SOTA image generation models in human-evaluated benchmarks. The SD3 model is based on an enhanced Diffusion Transformer architecture called MMDiT. Meanwhile, Anthropic released Claude 3 models, noted for human-like responses and emotional depth, scoring 79.88% on HumanEval but costing over twice as much as GPT-4. Microsoft launched new Orca-based models and datasets, and Latitude released DolphinCoder-StarCoder2-15b with strong coding capabilities. Integration of image models by Perplexity AI and 3D CAD generation by PolySpectra powered by LlamaIndex were also highlighted. "SD3's win rate beats all other SOTA image gen models (except perhaps Ideogram)" and "Claude 3 models are very good at generating d3 visualizations from text descriptions."