DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Abstract

Open-source large language models (LLMs) have developed rapidly
DeepSeek LLM project aims to advance open-source LLMs with long-term perspective
Developed scaling laws for hyperparameters and model/data allocation
Pre-training dataset of 2 trillion tokens
Conducted supervised fine-tuning (SFT) and direct preference optimization (DPO)
DeepSeek LLM 67B outperforms LLaMA-2 70B on various benchmarks
DeepSeek LLM 67B Chat shows superior performance compared to GPT-3.5 in open-ended evaluations

LLMs based on decoder-only Transformers have become cornerstone for AGI
Self-supervised pre-training on massive datasets enables various abilities
Supervised fine-tuning and reward modeling improve following user intentions
Closed products like ChatGPT, Claude, and Bard raised expectations for open-source LLMs
LLaMA series became de facto benchmark for open-source models
Open-source community focused on fixed-size models, neglecting scaling laws research
DeepSeek investigates scaling behavior and applies findings to 7B and 67B configurations
Aims to lay groundwork for future scaling of open-source LLMs

Training framework: HAI-LLM
Parallelism: data, tensor, sequence, and 1F1B pipeline
Flash attention technique
ZeRO-1 for optimizer state partitioning
Overlapped computation and communication
Fused layers/operators
BF16 precision training with FP32 gradient accumulation
In-place cross-entropy
Asynchronous model checkpointing
Evaluation: vLLM for generative tasks, continuous batching for non-generative tasks

Studied optimal batch size and learning rate for different compute budgets
Modeled power law relationship between compute budget and optimal hyperparameters
Formulae for optimal batch size and learning rate:
- η_opt = 0.3118 · C^(-0.1250)
- B_opt = 0.2920 · C^(0.3271)
Validated formulae on models with 1e20 compute budget

Used IsoFLOP profile approach to fit scaling curve
Introduced new model scale representation: non-embedding FLOPs/token (M)
Compute budget C = M * D
Fitted formulae for optimal non-embedding FLOPs/token and optimal tokens:
- M_opt = M_base · C^a, M_base = 0.1715, a = 0.5243
- D_opt = D_base · C^b, D_base = 5.8316, b = 0.4757
Accurately predicted performance of DeepSeek LLM 7B and 67B

Analyzed impact of different datasets on scaling laws
Higher quality data leads to increased model scaling exponent and decreased data scaling exponent
Suggests allocating more compute budget to model scaling for high-quality data

Collected 1.5 million instruction data instances in English and Chinese
Supervised Fine-Tuning (SFT):
- 7B model: 4 epochs
- 67B model: 2 epochs (to avoid overfitting)
- Learning rates: 1e-5 (7B) and 5e-6 (67B)
- Monitored benchmark accuracy and repetition ratio
Direct Preference Optimization (DPO):
- Constructed preference data for helpfulness and harmlessness
- Trained for one epoch with learning rate 5e-6 and batch size 512
- Used learning rate warmup and cosine learning rate scheduler

Evaluated on various English and Chinese benchmarks
Used perplexity-based and generation-based evaluation methods
DeepSeek models showed comparable performance to LLaMA2 on English benchmarks
DeepSeek 67B achieved better performance on MATH, GSM8K, HumanEval, MBPP, BBH, and Chinese benchmarks

Chinese: Used AlignBench testset
- DeepSeek 67B Chat outperformed ChatGPT and other baseline models
- DPO model showed improvement across almost all metrics
English: Used MT-Bench benchmark
- DeepSeek LLM 67B Chat outperformed other open-source models
- Achieved score comparable to GPT-3.5-turbo
- DPO stage further improved average score to 8.76

LeetCode: Used problems from Weekly Contest (July 2023 to Nov 2023)
Hungarian National High-School Exam: 33 problems, human-annotated
Instruction Following Evaluation: Used Google's dataset
Results showed significant performance gap between large and small models on held-out datasets

Staged Fine-Tuning: Implemented two-stage process for 7B model
Multi-Choice Question: Tested adding 20 million Chinese multi-choice questions
Instruction Data in Pre-Training: Integrated 5 million instruction data in final 10% of pre-training
System Prompt: Observed improved results with system prompt for 67B model

Introduced DeepSeek LLMs trained on 2 trillion tokens in English and Chinese
Calibrated scaling laws and proposed new optimal model/data scaling-up allocation strategy
Provided comprehensive evaluation of models
Acknowledged limitations: lack of ongoing knowledge updates, potential for non-factual information, and hallucinations
Future work:
- Release technique reports on code intelligence and Mixture-of-Experts (MoE)
- Construct larger and improved dataset for next version
- Study ways to deliver helpful, honest, and safe models
- Initial experiments show reinforcement learning could boost complex reasoning capability