deepseek-r1 deepseek-v3 qwen-2.5 llama-3.1 llama-3.3-70b deepseek ollama qwen llama reinforcement-learning fine-tuning model-distillation model-optimization reasoning reward-models multi-response-sampling model-training
DeepSeek released DeepSeek R1, a significant upgrade over DeepSeek V3 from just three weeks prior, featuring 8 models including full-size 671B MoE models and multiple distillations from Qwen 2.5 and Llama 3.1/3.3. The models are MIT licensed, allowing finetuning and distillation. Pricing is notably cheaper than o1 by 27x-50x. The training process used GRPO (reward for correctness and style outcomes) without relying on PRM, MCTS, or reward models, focusing on reasoning improvements through reinforcement learning. Distilled models can run on Ollama and show strong capabilities like writing Manim code. The release emphasizes advances in reinforcement-learning, fine-tuning, and model-distillation with a novel RL framework from DeepSeekMath.