DeepSeek R1 Paper Explained: What is it and How does it work?

πŸ‹ DeepSeek is the raising star in the field of AI research and development, making waves with their groundbreaking R1 model. In this blog post, I'll break down their recently published paper that details the architecture, training methodology, and capabilities of the R1 model.

The paper introduces DeepSeek R1, a large language model trained on a massive dataset with up to 8K context length. What makes it particularly interesting is its innovative training approach and impressive performance across various benchmarks.

Let's dive deep into the key aspects of the paper, exploring how DeepSeek R1 was developed, what makes it unique, and why it matters for the future of AI development.

How to read lengthy and complex technical papers faster and more indepth? This blog use rflow.ai to help with anaysis. The ResearchFLow digested version of DeepSeek-R1 paper is here. The original papers link is here


DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

  • Introduces DeepSeek-R1-Zero and DeepSeek-R1
  • DeepSeek-R1-Zero: trained via large-scale RL without supervised fine-tuning (SFT)
  • DeepSeek-R1: incorporates multi-stage training and cold-start data before RL
  • DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks
  • Open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1

image.png


1. Introduction

What is DeepSeek R1?

DeepSeek R1 represents a significant advancement in AI development, utilizing reinforcement learning (RL) to enhance language models' reasoning capabilities. The project consists of two main variants:

  • DeepSeek-R1-Zero: A pioneering model trained purely through reinforcement learning, without any supervised fine-tuning
  • DeepSeek-R1: An enhanced version that combines cold-start data with multi-stage training
  • Post-training as an important component of the full training pipeline
  • OpenAI's o1 series models introduced inference-time scaling
  • Challenge of effective test-time scaling remains an open question
  • Previous approaches: process-based reward models, reinforcement learning, search algorithms
  • Goal: Explore potential of LLMs to develop reasoning capabilities without supervised data

1.1 Contributions

Post-Training: Large-Scale Reinforcement Learning on the Base Model

  • Applied RL directly to base model without SFT
  • DeepSeek-R1-Zero demonstrates capabilities like self-verification, reflection, and generating long CoTs
  • First open research to validate reasoning capabilities can be incentivized purely through RL
  • Introduced pipeline to develop DeepSeek-R1 with two RL stages and two SFT stages

Distillation: Smaller Models Can Be Powerful Too

  • Demonstrated reasoning patterns of larger models can be distilled into smaller models
  • Open-sourced DeepSeek-R1 and its API for future research
  • Fine-tuned several dense models using reasoning data generated by DeepSeek-R1
  • Open-sourced distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series

1.2 Summary of Evaluation Results

Reasoning tasks

  • DeepSeek-R1 achieves 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217
  • 97.3% on MATH-500, on par with OpenAI-o1-1217
  • 2,029 Elo rating on Codeforces, outperforming 96.3% human participants

Knowledge

  • Outstanding results on MMLU, MMLU-Pro, and GPQA Diamond
  • 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond
  • Outperforms DeepSeek-V3 on SimpleQA

Others

  • Excels in creative writing, general question answering, editing, summarization
  • 87.6% win-rate on AlpacaEval 2.0 and 92.3% on ArenaHard
  • Outstanding performance on long-context understanding tasks

2. Approach

The Innovation: Pure Reinforcement Learning

What makes DeepSeek R1 truly special is its novel approach to training. Unlike traditional language models that rely heavily on supervised learning, DeepSeek R1 demonstrates that complex reasoning capabilities can > be developed primarily through reinforcement learning.

Key Technical Components

  • Group Relative Policy Optimization (GRPO) for efficient learning
  • Rule-based reward system with specific focus on reasoning accuracy
  • Innovative format rewards using tags to encourage explicit reasoning

2.1 Overview

  • Demonstrated reasoning capabilities can be improved through large-scale RL without SFT
  • Performance further enhanced with small amount of cold-start data
  • Three main components: DeepSeek-R1-Zero, DeepSeek-R1, and distillation to small dense models

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

2.2.1 Reinforcement Learning Algorithm

  • Group Relative Policy Optimization (GRPO)
  • Foregoes critic model, estimates baseline from group scores
  • Objective function:

image.png


2.2.2 Reward Modeling

  • Rule-based reward system
  • Accuracy rewards: evaluates correctness of response
  • Format rewards: enforces thinking process between '' and '' tags
  • No outcome or process neural reward model used

2.2.3 Training Template

  • Simple template guiding base model to adhere to instructions
  • Requires reasoning process followed by final answer
  • Limited constraints to structural format, avoiding content-specific biases

2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

Performance of DeepSeek-R1-Zero

  • Steady improvement throughout RL training
  • AIME 2024 pass@1 score increased from 15.6% to 71.0%
  • Comparable to OpenAI-o1-0912
  • Majority voting further improves performance to 86.7% on AIME

Self-evolution Process of DeepSeek-R1-Zero

  • Thinking time consistently improved during training
  • Naturally acquired ability to solve complex reasoning tasks
  • Emergence of sophisticated behaviors like reflection and alternative approach exploration

Aha Moment of DeepSeek-R1-Zero

  • Intermediate version learned to allocate more thinking time
  • Reevaluates initial approach
  • Demonstrates power of reinforcement learning in developing advanced problem-solving strategies

Drawback of DeepSeek-R1-Zero

  • Poor readability
  • Language mixing issues

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start

2.3.1 Cold Start

  • Constructed and collected small amount of long CoT data
  • Approaches: few-shot prompting, direct prompting, gathering DeepSeek-R1-Zero outputs, human annotation
  • Thousands of cold-start data used to fine-tune DeepSeek-V3-Base
  • Advantages: improved readability, better performance potential

2.3.2 Reasoning-oriented Reinforcement Learning

  • Applied large-scale RL training after fine-tuning
  • Focus on reasoning-intensive tasks: coding, mathematics, science, logic reasoning
  • Introduced language consistency reward to mitigate language mixing
  • Combined accuracy of reasoning tasks and language consistency reward

2.3.3 Rejection Sampling and Supervised Fine-Tuning

  • Used converged RL checkpoint to collect SFT data
  • Incorporated data from other domains (writing, role-playing, general-purpose tasks)
  • Reasoning data: curated prompts, rejection sampling, generative reward model
  • Non-reasoning data: adopted DeepSeek-V3 pipeline, reused portions of DeepSeek-V3 SFT dataset
  • Fine-tuned DeepSeek-V3-Base for two epochs using ~800k samples

2.3.4 Reinforcement Learning for all Scenarios

  • Secondary RL stage to improve helpfulness and harmlessness
  • Combined reward signals and diverse prompt distributions
  • Reasoning data: rule-based rewards
  • General data: reward models to capture human preferences
  • Helpfulness: focused on final summary
  • Harmlessness: evaluated entire response

2.4 Distillation: Empower Small Models with Reasoning Capability

  • Fine-tuned open-source models (Qwen, Llama) using 800k samples from DeepSeek-R1
  • Base models: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, Llama-3.3-70B-Instruct
  • Applied only SFT, no RL stage for distilled models

3. Experiment

Performance and Capabilities

The results are impressive. DeepSeek R1 has achieved:

  • 79.8% accuracy on AIME 2024 mathematical problems
  • 97.3% success rate on MATH-500 challenges
  • 90.8% accuracy on MMLU, demonstrating broad knowledge capabilities
  • Outstanding performance in coding tasks, surpassing many existing models

Benchmarks

  • MMLU, MMLU-Redux, MMLU-Pro, C-Eval, CMMLU, IFEval, FRAMES, GPQA Diamond, SimpleQA, C-SimpleQA, SWE-Bench Verified, Aider, LiveCodeBench, Codeforces, CNMO 2024, AIME 2024
  • Open-ended generation tasks: AlpacaEval 2.0, Arena-Hard

Evaluation Prompts

  • Standard benchmarks: simple-evals framework
  • MMLU-Redux: Zero-Eval prompt format
  • MMLU-Pro, C-Eval, CLUE-WSC: modified to zero-shot setting
  • Code and math benchmarks: specific evaluation protocols

Baselines

  • DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, OpenAI-o1-1217
  • For distilled models: QwQ-32B-Preview

Evaluation Setup

  • Maximum generation length: 32,768 tokens
  • Used pass@k evaluation with non-zero temperature
  • Sampling temperature: 0.6, top-p value: 0.95
  • Reported pass@1 and consensus (majority vote) results

3.1 DeepSeek-R1 Evaluation

Knowledge Benchmarks

  • Superior performance on MMLU, MMLU-Pro, GPQA Diamond compared to DeepSeek-V3
  • Significant gains in STEM-related questions
  • Excels on FRAMES, demonstrating strong document analysis capabilities
  • Outperforms DeepSeek-V3 on SimpleQA

Reasoning and Problem-Solving

  • Impressive results on IF-Eval, AlpacaEval2.0, and ArenaHard
  • Strong performance in writing tasks and open-domain question answering
  • Concise summary lengths: average 689 tokens on ArenaHard, 2,218 characters on AlpacaEval 2.0

Math and Coding Tasks

  • Performance on par with OpenAI-o1-1217 on math tasks
  • Dominates LiveCodeBench and Codeforces benchmarks
  • Comparable performance to OpenAI-o1-1217 on SWE Verified

3.2 Distilled Model Evaluation

  • DeepSeek-R1-7B outperforms GPT-4o-0513 and Claude-3.5-Sonnet across the board
  • DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics
  • DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks

4. Discussion

Distillation: Making Advanced AI More Accessible

Perhaps one of the most exciting aspects of DeepSeek R1 is its ability to transfer its reasoning capabilities to smaller models. The team has successfully created more compact versions ranging from 1.5B to 70B > > > > parameters, making advanced AI reasoning more accessible for practical applications.

Challenges and Future Directions

While DeepSeek R1 represents a significant breakthrough, there are still areas for improvement:

  • Enhancing general capabilities like function calling and multi-turn interactions
  • Improving handling of multiple languages
  • Optimizing zero-shot performance
  • Refining software engineering capabilities

4.1 Distillation v.s. Reinforcement Learning

  • Comparison of distilled models and RL-trained models
  • DeepSeek-R1-Distill-Qwen-32B outperforms DeepSeek-R1-Zero-Qwen-32B across all benchmarks
  • Conclusions:
    1. Distilling more powerful models into smaller ones yields excellent results
    2. Advancing beyond current boundaries may require more powerful base models and larger-scale RL

4.2 Unsuccessful Attempts

Process Reward Model (PRM)

  • Limitations: Difficulty in defining fine-grain steps, challenges in determining correctness of intermediate steps, potential for reward hacking

Monte Carlo Tree Search (MCTS)

  • Challenges: Large search space in token generation, difficulty in training fine-grained value model, complexity in iteratively improving model performance

5. Conclusion, Limitations, and Future Work

DeepSeek R1 represents a paradigm shift in AI development, demonstrating that sophisticated reasoning capabilities can be achieved through reinforcement learning alone. This breakthrough opens new possibilities > for creating more capable and efficient AI systems, while making advanced AI reasoning more accessible through successful distillation to smaller models.

Conclusion

  • DeepSeek-R1-Zero: Pure RL approach without cold-start data
  • DeepSeek-R1: Leverages cold-start data and iterative RL fine-tuning
  • Successful distillation of reasoning capability to small dense models

Limitations and Future Work

  • General Capability: Improving function calling, multi-turn, complex role-playing, and JSON output
  • Language Mixing: Addressing issues with handling queries in languages other than English or Chinese
  • Prompting Engineering: Optimizing for zero-shot settings
  • Software Engineering Tasks: Implementing rejection sampling or asynchronous evaluations during RL process