subscribe / issues / tags /

Topic: "on-device-models"

Mixture of Depths: Dynamically allocating compute in transformer-based language models

octopus-v2 deepmind transformer-efficiency dynamic-compute-allocation mixture-of-experts mixture-of-depths top-k-routing algorithmic-reasoning visual-autoregressive-modeling on-device-models function-calling scaling-laws piotrpadlewski

DeepMind introduces the Mixture-of-Depths (MoD) technique, dynamically allocating FLOPs across transformer layers to optimize compute usage, achieving over 50% faster forward passes without training impact. MoD selectively processes tokens using top-k routing, improving efficiency and potentially enabling faster ultra-long context handling. The method can combine with Mixture-of-Experts (MoE) for decoupled routing of queries, keys, and values. Reddit discussions highlight concerns about LLM hype overshadowing other AI tech, improvements in transformer efficiency, a new Think-and-Execute framework boosting algorithmic reasoning by 10-20%, and Visual Autoregressive modeling (VAR) surpassing diffusion models in image quality and speed. On-device model Octopus v2 outperforms GPT-4 in function calling accuracy and latency.

© 2026 • AINews

You can also subscribe by rss .

Press Esc or click anywhere to close