All tags
Topic: "multi-token-prediction"
DeepSeek v3: 671B finegrained MoE trained for $5.5m USD of compute on 15T tokens
deepseek-v3 gpt-4o claude-3.5-sonnet llama-3 deepseek-ai hugging-face openai anthropic mixture-of-experts model-training model-optimization reinforcement-learning chain-of-thought multi-token-prediction synthetic-data model-distillation fine-tuning attention-mechanisms gpu-optimization nrehiew_ denny_zhou
DeepSeek-V3 has launched with 671B MoE parameters and trained on 14.8T tokens, outperforming GPT-4o and Claude-3.5-sonnet in benchmarks. It was trained with only 2.788M H800 GPU hours, significantly less than Llama-3's 30.8M GPU-hours, showcasing major compute efficiency and cost reduction. The model is open-source and deployed via Hugging Face with API support. Innovations include native FP8 mixed precision training, Multi-Head Latent Attention scaling, distillation from synthetic reasoning data, pruning and healing for MoEs with up to 256 experts, and a new multi-token prediction objective enabling lookahead token planning. Research highlights also cover the OREO method and Natural Language Reinforcement Learning (NLRL) for multi-step reasoning and agent control.