Topic: "model-latency"

Jun 11, 2024

gemma mixtral phi dbrx apple google mistral-ai microsoft mosaic quantization on-device-ai adapter-models model-optimization model-latency lossless-quantization low-bit-palletization token-generation model-benchmarking human-evaluation craig-federighi andrej-karpathy

Apple Intelligence introduces a small (~3B parameters) on-device model and a larger server model running on Apple Silicon with Private Cloud Compute, aiming to surpass Google Gemma, Mistral Mixtral, Microsoft Phi, and Mosaic DBRX. The on-device model features a novel lossless quantization strategy using mixed 2-bit and 4-bit LoRA adapters averaging 3.5 bits-per-weight, enabling dynamic adapter hot-swapping and efficient memory management. Apple credits the Talaria tool for optimizing quantization and model latency, achieving about 0.6 ms time-to-first-token latency and 30 tokens per second generation rate on iPhone 15 Pro. Apple focuses on an "adapter for everything" strategy with initial deployment on SiriKit and App Intents. Performance benchmarks rely on human graders, emphasizing consumer-level adequacy over academic dominance. The Apple ML blog also mentions an Xcode code-focused model and a diffusion model for Genmoji.

You can also subscribe by rss .

Press Esc or click anywhere to close