🐋 DeepSeek is the raising star in the field of AI research and development, making waves with their groundbreaking R1 model. In this blog post, I'll break down their recently published paper that details the architecture, training methodology, and capabilities of the R1 model.

The paper introduces DeepSeek R1, a large language model trained on a massive dataset with up to 8K context length. What makes it particularly interesting is its innovative training approach and impressive performance across various benchmarks.

Let's dive deep into the key aspects of the paper, exploring how DeepSeek R1 was developed, what makes it unique, and why it matters for the future of AI development.

How to read lengthy and complex technical papers faster and more indepth? This blog use rflow.ai to help with anaysis. The ResearchFLow digested version of DeepSeek-R1 paper is here. The original papers link is here

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Abstract

Introduces DeepSeek-R1-Zero and DeepSeek-R1
DeepSeek-R1-Zero: trained via large-scale RL without supervised fine-tuning (SFT)
DeepSeek-R1: incorporates multi-stage training and cold-start data before RL
DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks
Open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1

1. Introduction

What is DeepSeek R1?

DeepSeek R1 represents a significant advancement in AI development, utilizing reinforcement learning (RL) to enhance language models' reasoning capabilities. The project consists of two main variants:

DeepSeek-R1-Zero: A pioneering model trained purely through reinforcement learning, without any supervised fine-tuning

DeepSeek-R1: An enhanced version that combines cold-start data with multi-stage training

Post-training as an important component of the full training pipeline
OpenAI's o1 series models introduced inference-time scaling
Challenge of effective test-time scaling remains an open question
Previous approaches: process-based reward models, reinforcement learning, search algorithms
Goal: Explore potential of LLMs to develop reasoning capabilities without supervised data

1.1 Contributions

Post-Training: Large-Scale Reinforcement Learning on the Base Model

Applied RL directly to base model without SFT
DeepSeek-R1-Zero demonstrates capabilities like self-verification, reflection, and generating long CoTs
First open research to validate reasoning capabilities can be incentivized purely through RL
Introduced pipeline to develop DeepSeek-R1 with two RL stages and two SFT stages

Distillation: Smaller Models Can Be Powerful Too

Demonstrated reasoning patterns of larger models can be distilled into smaller models
Open-sourced DeepSeek-R1 and its API for future research
Fine-tuned several dense models using reasoning data generated by DeepSeek-R1
Open-sourced distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series

1.2 Summary of Evaluation Results

Reasoning tasks

DeepSeek-R1 achieves 79.8% Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217
97.3% on MATH-500, on par with OpenAI-o1-1217
2,029 Elo rating on Codeforces, outperforming 96.3% human participants

Knowledge

Outstanding results on MMLU, MMLU-Pro, and GPQA Diamond
90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond
Outperforms DeepSeek-V3 on SimpleQA

Others

Excels in creative writing, general question answering, editing, summarization
87.6% win-rate on AlpacaEval 2.0 and 92.3% on ArenaHard
Outstanding performance on long-context understanding tasks

2. Approach

The Innovation: Pure Reinforcement Learning

What makes DeepSeek R1 truly special is its novel approach to training. Unlike traditional language models that rely heavily on supervised learning, DeepSeek R1 demonstrates that complex reasoning capabilities can > be developed primarily through reinforcement learning.

Key Technical Components

Group Relative Policy Optimization (GRPO) for efficient learning

Rule-based reward system with specific focus on reasoning accuracy

Innovative format rewards using tags to encourage explicit reasoning

2.1 Overview

Demonstrated reasoning capabilities can be improved through large-scale RL without SFT
Performance further enhanced with small amount of cold-start data
Three main components: DeepSeek-R1-Zero, DeepSeek-R1, and distillation to small dense models

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

2.2.1 Reinforcement Learning Algorithm

Group Relative Policy Optimization (GRPO)
Foregoes critic model, estimates baseline from group scores
Objective function:

2.2.2 Reward Modeling

Rule-based reward system
Accuracy rewards: evaluates correctness of response
Format rewards: enforces thinking process between '' and '' tags
No outcome or process neural reward model used

2.2.3 Training Template

Simple template guiding base model to adhere to instructions
Requires reasoning process followed by final answer
Limited constraints to structural format, avoiding content-specific biases

2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

Performance of DeepSeek-R1-Zero

Steady improvement throughout RL training
AIME 2024 pass@1 score increased from 15.6% to 71.0%
Comparable to OpenAI-o1-0912
Majority voting further improves performance to 86.7% on AIME

Self-evolution Process of DeepSeek-R1-Zero

Thinking time consistently improved during training
Naturally acquired ability to solve complex reasoning tasks
Emergence of sophisticated behaviors like reflection and alternative approach exploration

Aha Moment of DeepSeek-R1-Zero

Intermediate version learned to allocate more thinking time
Reevaluates initial approach
Demonstrates power of reinforcement learning in developing advanced problem-solving strategies

Drawback of DeepSeek-R1-Zero

Poor readability
Language mixing issues

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start

2.3.1 Cold Start

Constructed and collected small amount of long CoT data
Approaches: few-shot prompting, direct prompting, gathering DeepSeek-R1-Zero outputs, human annotation
Thousands of cold-start data used to fine-tune DeepSeek-V3-Base
Advantages: improved readability, better performance potential

2.3.2 Reasoning-oriented Reinforcement Learning

Applied large-scale RL training after fine-tuning
Focus on reasoning-intensive tasks: coding, mathematics, science, logic reasoning
Introduced language consistency reward to mitigate language mixing
Combined accuracy of reasoning tasks and language consistency reward

2.3.3 Rejection Sampling and Supervised Fine-Tuning

Used converged RL checkpoint to collect SFT data
Incorporated data from other domains (writing, role-playing, general-purpose tasks)
Reasoning data: curated prompts, rejection sampling, generative reward model
Non-reasoning data: adopted DeepSeek-V3 pipeline, reused portions of DeepSeek-V3 SFT dataset
Fine-tuned DeepSeek-V3-Base for two epochs using ~800k samples

2.3.4 Reinforcement Learning for all Scenarios

Secondary RL stage to improve helpfulness and harmlessness
Combined reward signals and diverse prompt distributions
Reasoning data: rule-based rewards
General data: reward models to capture human preferences
Helpfulness: focused on final summary
Harmlessness: evaluated entire response

2.4 Distillation: Empower Small Models with Reasoning Capability

Fine-tuned open-source models (Qwen, Llama) using 800k samples from DeepSeek-R1
Base models: Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, Llama-3.3-70B-Instruct
Applied only SFT, no RL stage for distilled models

3. Experiment

Performance and Capabilities

The results are impressive. DeepSeek R1 has achieved:

79.8% accuracy on AIME 2024 mathematical problems

97.3% success rate on MATH-500 challenges

90.8% accuracy on MMLU, demonstrating broad knowledge capabilities

Outstanding performance in coding tasks, surpassing many existing models

Benchmarks

MMLU, MMLU-Redux, MMLU-Pro, C-Eval, CMMLU, IFEval, FRAMES, GPQA Diamond, SimpleQA, C-SimpleQA, SWE-Bench Verified, Aider, LiveCodeBench, Codeforces, CNMO 2024, AIME 2024
Open-ended generation tasks: AlpacaEval 2.0, Arena-Hard

Evaluation Prompts

Standard benchmarks: simple-evals framework
MMLU-Redux: Zero-Eval prompt format
MMLU-Pro, C-Eval, CLUE-WSC: modified to zero-shot setting
Code and math benchmarks: specific evaluation protocols

Baselines

DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, OpenAI-o1-1217
For distilled models: QwQ-32B-Preview

Evaluation Setup

Maximum generation length: 32,768 tokens
Used pass@k evaluation with non-zero temperature
Sampling temperature: 0.6, top-p value: 0.95
Reported pass@1 and consensus (majority vote) results

3.1 DeepSeek-R1 Evaluation

Knowledge Benchmarks

Superior performance on MMLU, MMLU-Pro, GPQA Diamond compared to DeepSeek-V3
Significant gains in STEM-related questions
Excels on FRAMES, demonstrating strong document analysis capabilities
Outperforms DeepSeek-V3 on SimpleQA

Reasoning and Problem-Solving

Impressive results on IF-Eval, AlpacaEval2.0, and ArenaHard
Strong performance in writing tasks and open-domain question answering
Concise summary lengths: average 689 tokens on ArenaHard, 2,218 characters on AlpacaEval 2.0

Math and Coding Tasks

Performance on par with OpenAI-o1-1217 on math tasks
Dominates LiveCodeBench and Codeforces benchmarks
Comparable performance to OpenAI-o1-1217 on SWE Verified

3.2 Distilled Model Evaluation

DeepSeek-R1-7B outperforms GPT-4o-0513 and Claude-3.5-Sonnet across the board
DeepSeek-R1-14B surpasses QwQ-32B-Preview on all evaluation metrics
DeepSeek-R1-32B and DeepSeek-R1-70B significantly exceed o1-mini on most benchmarks

4. Discussion

Distillation: Making Advanced AI More Accessible

Perhaps one of the most exciting aspects of DeepSeek R1 is its ability to transfer its reasoning capabilities to smaller models. The team has successfully created more compact versions ranging from 1.5B to 70B > > > > parameters, making advanced AI reasoning more accessible for practical applications.

Challenges and Future Directions

While DeepSeek R1 represents a significant breakthrough, there are still areas for improvement:

Enhancing general capabilities like function calling and multi-turn interactions

Improving handling of multiple languages

Optimizing zero-shot performance

Refining software engineering capabilities

4.1 Distillation v.s. Reinforcement Learning

Comparison of distilled models and RL-trained models
DeepSeek-R1-Distill-Qwen-32B outperforms DeepSeek-R1-Zero-Qwen-32B across all benchmarks
Conclusions:
1. Distilling more powerful models into smaller ones yields excellent results
2. Advancing beyond current boundaries may require more powerful base models and larger-scale RL

4.2 Unsuccessful Attempts

Process Reward Model (PRM)

Limitations: Difficulty in defining fine-grain steps, challenges in determining correctness of intermediate steps, potential for reward hacking

Monte Carlo Tree Search (MCTS)

Challenges: Large search space in token generation, difficulty in training fine-grained value model, complexity in iteratively improving model performance

5. Conclusion, Limitations, and Future Work

DeepSeek R1 represents a paradigm shift in AI development, demonstrating that sophisticated reasoning capabilities can be achieved through reinforcement learning alone. This breakthrough opens new possibilities > for creating more capable and efficient AI systems, while making advanced AI reasoning more accessible through successful distillation to smaller models.

Conclusion

DeepSeek-R1-Zero: Pure RL approach without cold-start data
DeepSeek-R1: Leverages cold-start data and iterative RL fine-tuning
Successful distillation of reasoning capability to small dense models

Limitations and Future Work

General Capability: Improving function calling, multi-turn, complex role-playing, and JSON output
Language Mixing: Addressing issues with handling queries in languages other than English or Chinese
Prompting Engineering: Optimizing for zero-shot settings
Software Engineering Tasks: Implementing rejection sampling or asynchronous evaluations during RL process

DeepSeek R1 Paper Explained: What is it and How does it work?