DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Abstract
- Open-source large language models (LLMs) have developed rapidly
- DeepSeek LLM project aims to advance open-source LLMs with long-term perspective
- Developed scaling laws for hyperparameters and model/data allocation
- Pre-training dataset of 2 trillion tokens
- Conducted supervised fine-tuning (SFT) and direct preference optimization (DPO)
- DeepSeek LLM 67B outperforms LLaMA-2 70B on various benchmarks
- DeepSeek LLM 67B Chat shows superior performance compared to GPT-3.5 in open-ended evaluations
1. Introduction
- LLMs based on decoder-only Transformers have become cornerstone for AGI
- Self-supervised pre-training on massive datasets enables various abilities
- Supervised fine-tuning and reward modeling improve following user intentions
- Closed products like ChatGPT, Claude, and Bard raised expectations for open-source LLMs
- LLaMA series became de facto benchmark for open-source models
- Open-source community focused on fixed-size models, neglecting scaling laws research
- DeepSeek investigates scaling behavior and applies findings to 7B and 67B configurations
- Aims to lay groundwork for future scaling of open-source LLMs
Key aspects of DeepSeek LLM
- 2 trillion tokens for pre-training in Chinese and English
- Architecture based on LLaMA with multi-step learning rate scheduler
- 1 million instances for supervised fine-tuning (SFT)
- Direct preference optimization (DPO) to improve conversational performance
- Extensive evaluations of base and chat models
- DeepSeek LLM 67B outperforms LLaMA-2 70B in various benchmarks
- DeepSeek 67B chat model surpasses GPT-3.5 in open-ended evaluations
2. Pre-Training
2.1 Data
- Objective: Enhance richness and diversity of dataset
- Three essential stages: deduplication, filtering, and remixing
- Aggressive deduplication strategy across multiple dumps
- Filtering stage focuses on document quality assessment
- Remixing phase addresses data imbalances
- Tokenizer: Byte-level Byte-Pair Encoding (BBPE) algorithm
- Vocabulary size: 100,015 tokens (100,000 conventional + 15 special)
2.2 Architecture
- Based on LLaMA design with some modifications
- Pre-Norm structure with RMSNorm function
- SwiGLU activation function for Feed-Forward Network (FFN)
- Rotary Embedding for positional encoding
- 67B model uses Grouped-Query Attention (GQA)
- DeepSeek LLM 7B: 30 layers
- DeepSeek LLM 67B: 95 layers
- Expanded 67B model's parameters in network depth for better performance
2.3 Hyperparameters
- Initialization: standard deviation of 0.006
- Optimizer: AdamW (β₁ = 0.9, β₂ = 0.95, weight_decay = 0.1)
- Learning rate scheduler: multi-step
- Gradient clipping: 1.0
- Batch size and learning rate vary with model size
2.4 Infrastructures
- Training framework: HAI-LLM
- Parallelism: data, tensor, sequence, and 1F1B pipeline
- Flash attention technique
- ZeRO-1 for optimizer state partitioning
- Overlapped computation and communication
- Fused layers/operators
- BF16 precision training with FP32 gradient accumulation
- In-place cross-entropy
- Asynchronous model checkpointing
- Evaluation: vLLM for generative tasks, continuous batching for non-generative tasks
3. Scaling Laws
3.1 Scaling Laws for Hyperparameters
- Studied optimal batch size and learning rate for different compute budgets
- Modeled power law relationship between compute budget and optimal hyperparameters
- Formulae for optimal batch size and learning rate:
- η_opt = 0.3118 · C^(-0.1250)
- B_opt = 0.2920 · C^(0.3271)
- Validated formulae on models with 1e20 compute budget
3.2 Estimating Optimal Model and Data Scaling
- Used IsoFLOP profile approach to fit scaling curve
- Introduced new model scale representation: non-embedding FLOPs/token (M)
- Compute budget C = M * D
- Fitted formulae for optimal non-embedding FLOPs/token and optimal tokens:
- M_opt = M_base · C^a, M_base = 0.1715, a = 0.5243
- D_opt = D_base · C^b, D_base = 5.8316, b = 0.4757
- Accurately predicted performance of DeepSeek LLM 7B and 67B
3.3 Scaling Laws with Different Data
- Analyzed impact of different datasets on scaling laws
- Higher quality data leads to increased model scaling exponent and decreased data scaling exponent
- Suggests allocating more compute budget to model scaling for high-quality data
4. Alignment
- Collected 1.5 million instruction data instances in English and Chinese
- Supervised Fine-Tuning (SFT):
- 7B model: 4 epochs
- 67B model: 2 epochs (to avoid overfitting)
- Learning rates: 1e-5 (7B) and 5e-6 (67B)
- Monitored benchmark accuracy and repetition ratio
- Direct Preference Optimization (DPO):
- Constructed preference data for helpfulness and harmlessness
- Trained for one epoch with learning rate 5e-6 and batch size 512
- Used learning rate warmup and cosine learning rate scheduler
5. Evaluation
5.1 Public Benchmark Evaluation
- Evaluated on various English and Chinese benchmarks
- Used perplexity-based and generation-based evaluation methods
- DeepSeek models showed comparable performance to LLaMA2 on English benchmarks
- DeepSeek 67B achieved better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks
5.2 Open-Ended Evaluation
- Chinese: Used AlignBench testset
- DeepSeek 67B Chat outperformed ChatGPT and other baseline models
- DPO model showed improvement across almost all metrics
- English: Used MT-Bench benchmark
- DeepSeek LLM 67B Chat outperformed other open-source models
- Achieved score comparable to GPT-3.5-turbo
- DPO stage further improved average score to 8.76
5.3 Held-Out Evaluation
- LeetCode: Used problems from Weekly Contest (July 2023 to Nov 2023)
- Hungarian National High-School Exam: 33 problems, human-annotated
- Instruction Following Evaluation: Used Google's dataset
- Results showed significant performance gap between large and small models on held-out datasets
5.4 Safety Evaluation
- Established 20-person expert team for safety content classification
- Constructed 2400 safety test questions
- Evaluated DeepSeek 67B Chat model
- Model exhibited good security performance across safety test categories
- Used "Do-Not-Answer" dataset for additional evaluation
- DeepSeek 67B Chat achieved a score of 97.8, higher than both ChatGPT and GPT-4
5.5 Discussion
- Staged Fine-Tuning: Implemented two-stage process for 7B model
- Multi-Choice Question: Tested adding 20 million Chinese multi-choice questions
- Instruction Data in Pre-Training: Integrated 5 million instruction data in final 10% of pre-training
- System Prompt: Observed improved results with system prompt for 67B model
6. Conclusion, Limitation, and Future Work
- Introduced DeepSeek LLMs trained on 2 trillion tokens in English and Chinese
- Calibrated scaling laws and proposed new optimal model/data scaling-up allocation strategy
- Provided comprehensive evaluation of models
- Acknowledged limitations: lack of ongoing knowledge updates, potential for non-factual information, and hallucinations
- Future work:
- Release technique reports on code intelligence and Mixture-of-Experts (MoE)
- Construct larger and improved dataset for next version
- Study ways to deliver helpful, honest, and safe models
- Initial experiments show reinforcement learning could boost complex reasoning capability