DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Abstract

  • Open-source large language models (LLMs) have developed rapidly
  • DeepSeek LLM project aims to advance open-source LLMs with long-term perspective
  • Developed scaling laws for hyperparameters and model/data allocation
  • Pre-training dataset of 2 trillion tokens
  • Conducted supervised fine-tuning (SFT) and direct preference optimization (DPO)
  • DeepSeek LLM 67B outperforms LLaMA-2 70B on various benchmarks
  • DeepSeek LLM 67B Chat shows superior performance compared to GPT-3.5 in open-ended evaluations

1. Introduction

  • LLMs based on decoder-only Transformers have become cornerstone for AGI
  • Self-supervised pre-training on massive datasets enables various abilities
  • Supervised fine-tuning and reward modeling improve following user intentions
  • Closed products like ChatGPT, Claude, and Bard raised expectations for open-source LLMs
  • LLaMA series became de facto benchmark for open-source models
  • Open-source community focused on fixed-size models, neglecting scaling laws research
  • DeepSeek investigates scaling behavior and applies findings to 7B and 67B configurations
  • Aims to lay groundwork for future scaling of open-source LLMs

Key aspects of DeepSeek LLM

  • 2 trillion tokens for pre-training in Chinese and English
  • Architecture based on LLaMA with multi-step learning rate scheduler
  • 1 million instances for supervised fine-tuning (SFT)
  • Direct preference optimization (DPO) to improve conversational performance
  • Extensive evaluations of base and chat models
  • DeepSeek LLM 67B outperforms LLaMA-2 70B in various benchmarks
  • DeepSeek 67B chat model surpasses GPT-3.5 in open-ended evaluations

2. Pre-Training

2.1 Data

  • Objective: Enhance richness and diversity of dataset
  • Three essential stages: deduplication, filtering, and remixing
  • Aggressive deduplication strategy across multiple dumps
  • Filtering stage focuses on document quality assessment
  • Remixing phase addresses data imbalances
  • Tokenizer: Byte-level Byte-Pair Encoding (BBPE) algorithm
  • Vocabulary size: 100,015 tokens (100,000 conventional + 15 special)

2.2 Architecture

  • Based on LLaMA design with some modifications
  • Pre-Norm structure with RMSNorm function
  • SwiGLU activation function for Feed-Forward Network (FFN)
  • Rotary Embedding for positional encoding
  • 67B model uses Grouped-Query Attention (GQA)
  • DeepSeek LLM 7B: 30 layers
  • DeepSeek LLM 67B: 95 layers
  • Expanded 67B model's parameters in network depth for better performance

2.3 Hyperparameters

  • Initialization: standard deviation of 0.006
  • Optimizer: AdamW (β₁ = 0.9, β₂ = 0.95, weight_decay = 0.1)
  • Learning rate scheduler: multi-step
  • Gradient clipping: 1.0
  • Batch size and learning rate vary with model size

2.4 Infrastructures

  • Training framework: HAI-LLM
  • Parallelism: data, tensor, sequence, and 1F1B pipeline
  • Flash attention technique
  • ZeRO-1 for optimizer state partitioning
  • Overlapped computation and communication
  • Fused layers/operators
  • BF16 precision training with FP32 gradient accumulation
  • In-place cross-entropy
  • Asynchronous model checkpointing
  • Evaluation: vLLM for generative tasks, continuous batching for non-generative tasks

3. Scaling Laws

3.1 Scaling Laws for Hyperparameters

  • Studied optimal batch size and learning rate for different compute budgets
  • Modeled power law relationship between compute budget and optimal hyperparameters
  • Formulae for optimal batch size and learning rate:
    • η_opt = 0.3118 · C^(-0.1250)
    • B_opt = 0.2920 · C^(0.3271)
  • Validated formulae on models with 1e20 compute budget

3.2 Estimating Optimal Model and Data Scaling

  • Used IsoFLOP profile approach to fit scaling curve
  • Introduced new model scale representation: non-embedding FLOPs/token (M)
  • Compute budget C = M * D
  • Fitted formulae for optimal non-embedding FLOPs/token and optimal tokens:
    • M_opt = M_base · C^a, M_base = 0.1715, a = 0.5243
    • D_opt = D_base · C^b, D_base = 5.8316, b = 0.4757
  • Accurately predicted performance of DeepSeek LLM 7B and 67B

3.3 Scaling Laws with Different Data

  • Analyzed impact of different datasets on scaling laws
  • Higher quality data leads to increased model scaling exponent and decreased data scaling exponent
  • Suggests allocating more compute budget to model scaling for high-quality data

4. Alignment

  • Collected 1.5 million instruction data instances in English and Chinese
  • Supervised Fine-Tuning (SFT):
    • 7B model: 4 epochs
    • 67B model: 2 epochs (to avoid overfitting)
    • Learning rates: 1e-5 (7B) and 5e-6 (67B)
    • Monitored benchmark accuracy and repetition ratio
  • Direct Preference Optimization (DPO):
    • Constructed preference data for helpfulness and harmlessness
    • Trained for one epoch with learning rate 5e-6 and batch size 512
    • Used learning rate warmup and cosine learning rate scheduler

5. Evaluation

5.1 Public Benchmark Evaluation

  • Evaluated on various English and Chinese benchmarks
  • Used perplexity-based and generation-based evaluation methods
  • DeepSeek models showed comparable performance to LLaMA2 on English benchmarks
  • DeepSeek 67B achieved better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks

5.2 Open-Ended Evaluation

  • Chinese: Used AlignBench testset
    • DeepSeek 67B Chat outperformed ChatGPT and other baseline models
    • DPO model showed improvement across almost all metrics
  • English: Used MT-Bench benchmark
    • DeepSeek LLM 67B Chat outperformed other open-source models
    • Achieved score comparable to GPT-3.5-turbo
    • DPO stage further improved average score to 8.76

5.3 Held-Out Evaluation

  • LeetCode: Used problems from Weekly Contest (July 2023 to Nov 2023)
  • Hungarian National High-School Exam: 33 problems, human-annotated
  • Instruction Following Evaluation: Used Google's dataset
  • Results showed significant performance gap between large and small models on held-out datasets

5.4 Safety Evaluation

  • Established 20-person expert team for safety content classification
  • Constructed 2400 safety test questions
  • Evaluated DeepSeek 67B Chat model
  • Model exhibited good security performance across safety test categories
  • Used "Do-Not-Answer" dataset for additional evaluation
  • DeepSeek 67B Chat achieved a score of 97.8, higher than both ChatGPT and GPT-4

5.5 Discussion

  • Staged Fine-Tuning: Implemented two-stage process for 7B model
  • Multi-Choice Question: Tested adding 20 million Chinese multi-choice questions
  • Instruction Data in Pre-Training: Integrated 5 million instruction data in final 10% of pre-training
  • System Prompt: Observed improved results with system prompt for 67B model

6. Conclusion, Limitation, and Future Work

  • Introduced DeepSeek LLMs trained on 2 trillion tokens in English and Chinese
  • Calibrated scaling laws and proposed new optimal model/data scaling-up allocation strategy
  • Provided comprehensive evaluation of models
  • Acknowledged limitations: lack of ongoing knowledge updates, potential for non-factual information, and hallucinations
  • Future work:
    • Release technique reports on code intelligence and Mixture-of-Experts (MoE)
    • Construct larger and improved dataset for next version
    • Study ways to deliver helpful, honest, and safe models
    • Initial experiments show reinforcement learning could boost complex reasoning capability