All tags
Topic: "kv-cache"
Pixtral Large (124B) beats Llama 3.2 90B with updated Mistral Large 24.11
pixtral-large mistral-large-24.11 llama-3-2 qwen2.5-7b-instruct-abliterated-v2-gguf qwen2.5-32b-q3_k_m vllm llama-cpp exllamav2 tabbyapi mistral-ai sambanova nvidia multimodality vision model-updates chatbots inference gpu-optimization quantization performance concurrency kv-cache arthur-mensch
Mistral has updated its Pixtral Large vision encoder to 1B parameters and released an update to the 123B parameter Mistral Large 24.11 model, though the update lacks major new features. Pixtral Large outperforms Llama 3.2 90B on multimodal benchmarks despite having a smaller vision adapter. Mistral's Le Chat chatbot received comprehensive feature updates, reflecting a company focus on product and research balance as noted by Arthur Mensch. SambaNova sponsors inference with their RDUs offering faster AI model processing than GPUs. On Reddit, vLLM shows strong concurrency performance on an RTX 3090 GPU, with quantization challenges noted in FP8 kv-cache but better results using llama.cpp with Q8 kv-cache. Users discuss performance trade-offs between vLLM, exllamav2, and TabbyAPI for different model sizes and batching strategies.
Shazeer et al (2024): you are overpaying for inference >13x
claude-3.5-sonnet claude-3-opus character.ai anthropic memory-efficiency kv-cache attention-mechanisms stateful-caching int8-precision transformer-architecture scaling overfitting architecture noam-shazeer kevin-a-fischer sebastien-bubeck _aidan_clark_ andrej-karpathy
Noam Shazeer explains how Character.ai serves 20% of Google Search Traffic for LLM inference while reducing serving costs by a factor of 33 compared to late 2022, with leading commercial APIs costing at least 13.5X more. Key memory-efficiency techniques include MQA > GQA reducing KV cache size by 8X, hybrid attention horizons, cross-layer KV-sharing, stateful caching with a 95% cache rate, and native int8 precision with custom kernels. Anthropic released Claude 3.5 Sonnet, which outperforms Claude 3 Opus at twice the speed and one-fifth the cost, passing 64% of internal pull request tests and introducing new features like Artifacts for real-time doc and code generation. Discussions on LLM architecture highlight the dominance of transformers, challenges in scaling and overfitting, and the importance of architecture work for progress.