DeepSeek-Coder: When the Large Language Model Meets Programming
Abstract
- Open-source code models with sizes from 1.3B to 33B
- Trained from scratch on 2 trillion tokens
- Pre-trained on high-quality project-level code corpus
- Employs fill-in-the-blank task with 16K window
- Achieves state-of-the-art performance among open-source code models
- Surpasses existing closed-source models like Codex and GPT-3.5
- Under permissive license for research and commercial use
1. Introduction
- Transformation of software development by large language models
- Challenge: Performance gap between open-source and closed-source models
- DeepSeek-Coder series introduced to address this challenge
- Models trained from scratch on 2 trillion tokens from 87 programming languages
- Pre-training data organized at repository level
- Incorporation of Fill-In-Middle (FIM) approach
- Context length extended to 16K
- Comprehensive experiments conducted on public code-related benchmarks
- DeepSeek-Coder-Base 33B outperforms existing open-source models
- DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in most benchmarks
Main Contributions
- Introduction of DeepSeek-Coder-Base and DeepSeek-Coder-Instruct
- Incorporation of repository-level data construction in pre-training
- Analysis of FIM training strategies impact on code models
- Extensive evaluations against various benchmarks
2. Data Collection
Training Dataset Composition
- 87% source code
- 10% English code-related natural language corpus
- 3% code-unrelated Chinese natural language corpus
Data Creation Procedure
- Data crawling
- Rule-based filtering
- Dependency parsing
- Repository-level deduplication
- Quality screening
2.1. GitHub Data Crawling and Filtering
- Collection of public repositories created before February 2023
- Retention of 87 programming languages
- Application of filtering rules similar to StarCoder project
- Reduction of data to 32.8% of original size
2.2. Dependency Parsing
- Consideration of dependencies between different files in a project
- Use of topological sort algorithm for dependency analysis
- Arrangement of files based on dependencies
2.3. Repo-Level Deduplication
- Near-deduplication performed at repository level
- Concatenated code from repository level treated as a single sample
- Ensures integrity of repository structure
2.4. Quality Screening and Decontamination
- Use of compiler and quality model with heuristic rules
- Filtering of low-quality data (syntax errors, poor readability, low modularity)
- N-gram filtering process to avoid contamination from test set
- Exclusion of files containing docstrings, questions, and solutions from specific sources
3. Training Policy
3.1. Training Strategy
3.1.1. Next Token Prediction
- Concatenation of files to form fixed-length entry
- Training model to predict subsequent token based on provided context
3.1.2. Fill-in-the-Middle (FIM)
- Random division of text into three parts
- Shuffling of parts and connection with special characters
- Two modes: PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle)
- Enhances model's capability to handle various structural arrangements in code
3.2. Tokenizer
- Use of HuggingFace Tokenizer library
- Training of Byte Pair Encoding (BPE) tokenizers
- Vocabulary size of 32,000
3.3. Model Architecture
- Range of models with 1.3B, 6.7B, and 33B parameters
- Built on DeepSeek Large Language Model (LLM) framework
- Decoder-only Transformer with Rotary Position Embedding (RoPE)
- Integration of Grouped-Query-Attention (GQA) in 33B model
- Use of FlashAttention v2 for attention mechanism computation
3.4. Optimization
- Use of AdamW optimizer
- Adaptation of batch sizes and learning rates
- Three-stage learning rate scheduling policy
3.5. Environments
- Use of HAI-LLM framework
- Incorporation of various parallelism strategies
- Experiments conducted on clusters with NVIDIA A100 and H800 GPUs
3.6. Long Context
- Reconfiguration of RoPE parameters to extend default context window
- Linear scaling strategy applied
- Theoretical processing of up to 64K tokens in context
- Reliable outputs within 16K token range
3.7. Instruction Tuning
- Development of DeepSeek-Coder-Instruct through fine-tuning
- Use of high-quality data structured by Alpaca Instruction format
- Unique delimiter token <|EOT|> for dialogue turn demarcation
- Training with cosine schedule and specific learning rate
4. Experimental Results
4.1. Code Generation
- Evaluation on HumanEval and MBPP benchmarks
- Expansion of HumanEval to seven additional programming languages
- DeepSeek-Coder-Base 33B achieves state-of-the-art performance
- DeepSeek-Coder-Base 6.7B performs on par with 34B parameter CodeLlama
- DeepSeek-Coder-Instruct 33B outperforms OpenAI GPT-3.5 Turbo
4.2. Fill-in-the-Middle Code Completion
- Evaluation on Single-Line Infilling benchmarks
- DeepSeek-Coder outperforms larger counterparts
- Correlation between model size and performance observed
- Recommendation of DeepSeek-Coder-Base 6.7B for code completion tools
4.3. Cross-File Code Completion
- Evaluation using CrossCodeEval dataset
- DeepSeek-Coder consistently outperforms other models
- Effectiveness of repository-level pre-training demonstrated
4.4. Program-based Math Reasoning
- Evaluation using Program-Aided Math Reasoning (PAL) method
- Assessment across seven distinct benchmarks
- DeepSeek-Coder models achieve remarkable performance
- Potential for complex mathematical computations and problem-solving demonstrated
5. Continue Pre-Training From General LLM
- Additional pre-training from DeepSeek-LLM-7B Base on 2 trillion tokens
- Resulting in DeepSeek-Coder-v1.5 7B
- Use of specific data sources for pre-training
- Comparison between DeepSeek-Coder-v1.5 7B and DeepSeek-Coder 6.7B
- Evaluation across programming, math reasoning, and natural language tasks
- DeepSeek-Coder-Base-v1.5 shows improvements in most tasks, especially in math reasoning and natural language processing
6. Conclusion
- Introduction of DeepSeek-Coder series with 1.3B, 6.7B, and 33B parameters
- Trained on meticulously curated project-level code corpus
- Use of "fill-in-the-blank" pre-training objective
- Extension of context window to 16,384 tokens
- DeepSeek-Coder-Base 33B surpasses existing open-source code models
- DeepSeek-Coder-Base 6.7B performs on par with 34B parameter CodeLlama
- DeepSeek-Coder-Instruct 33B outperforms OpenAI GPT-3.5 Turbo in coding-related tasks
- Development of DeepSeek-Coder-v1.5 with improved natural language understanding
- Future commitment to develop and share more powerful code-focused LLMs