All tags
Person: "kevin-a-fischer"
Shazeer et al (2024): you are overpaying for inference >13x
claude-3.5-sonnet claude-3-opus character.ai anthropic memory-efficiency kv-cache attention-mechanisms stateful-caching int8-precision transformer-architecture scaling overfitting architecture noam-shazeer kevin-a-fischer sebastien-bubeck _aidan_clark_ andrej-karpathy
Noam Shazeer explains how Character.ai serves 20% of Google Search Traffic for LLM inference while reducing serving costs by a factor of 33 compared to late 2022, with leading commercial APIs costing at least 13.5X more. Key memory-efficiency techniques include MQA > GQA reducing KV cache size by 8X, hybrid attention horizons, cross-layer KV-sharing, stateful caching with a 95% cache rate, and native int8 precision with custom kernels. Anthropic released Claude 3.5 Sonnet, which outperforms Claude 3 Opus at twice the speed and one-fifth the cost, passing 64% of internal pull request tests and introducing new features like Artifacts for real-time doc and code generation. Discussions on LLM architecture highlight the dominance of transformers, challenges in scaling and overfitting, and the importance of architecture work for progress.