DeepSeek-Coder: When the Large Language Model Meets Programming

Abstract

Open-source code models with sizes from 1.3B to 33B
Trained from scratch on 2 trillion tokens
Pre-trained on high-quality project-level code corpus
Employs fill-in-the-blank task with 16K window
Achieves state-of-the-art performance among open-source code models
Surpasses existing closed-source models like Codex and GPT-3.5
Under permissive license for research and commercial use

1. Introduction

Transformation of software development by large language models
Challenge: Performance gap between open-source and closed-source models
DeepSeek-Coder series introduced to address this challenge
Models trained from scratch on 2 trillion tokens from 87 programming languages
Pre-training data organized at repository level
Incorporation of Fill-In-Middle (FIM) approach
Context length extended to 16K
Comprehensive experiments conducted on public code-related benchmarks
DeepSeek-Coder-Base 33B outperforms existing open-source models
DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in most benchmarks

Main Contributions

Introduction of DeepSeek-Coder-Base and DeepSeek-Coder-Instruct
Incorporation of repository-level data construction in pre-training
Analysis of FIM training strategies impact on code models
Extensive evaluations against various benchmarks

2. Data Collection

Training Dataset Composition

87% source code
10% English code-related natural language corpus
3% code-unrelated Chinese natural language corpus

Data Creation Procedure

Data crawling
Rule-based filtering
Dependency parsing
Repository-level deduplication
Quality screening

2.1. GitHub Data Crawling and Filtering

Collection of public repositories created before February 2023
Retention of 87 programming languages
Application of filtering rules similar to StarCoder project
Reduction of data to 32.8% of original size

2.2. Dependency Parsing

Consideration of dependencies between different files in a project
Use of topological sort algorithm for dependency analysis
Arrangement of files based on dependencies

2.3. Repo-Level Deduplication

Near-deduplication performed at repository level
Concatenated code from repository level treated as a single sample
Ensures integrity of repository structure

2.4. Quality Screening and Decontamination

Use of compiler and quality model with heuristic rules
Filtering of low-quality data (syntax errors, poor readability, low modularity)
N-gram filtering process to avoid contamination from test set
Exclusion of files containing docstrings, questions, and solutions from specific sources

3. Training Policy

3.1. Training Strategy

3.1.1. Next Token Prediction

Concatenation of files to form fixed-length entry
Training model to predict subsequent token based on provided context

3.1.2. Fill-in-the-Middle (FIM)

Random division of text into three parts
Shuffling of parts and connection with special characters
Two modes: PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle)
Enhances model's capability to handle various structural arrangements in code

3.2. Tokenizer

Use of HuggingFace Tokenizer library
Training of Byte Pair Encoding (BPE) tokenizers
Vocabulary size of 32,000

3.3. Model Architecture

Range of models with 1.3B, 6.7B, and 33B parameters
Built on DeepSeek Large Language Model (LLM) framework
Decoder-only Transformer with Rotary Position Embedding (RoPE)
Integration of Grouped-Query-Attention (GQA) in 33B model
Use of FlashAttention v2 for attention mechanism computation

3.4. Optimization

Use of AdamW optimizer
Adaptation of batch sizes and learning rates
Three-stage learning rate scheduling policy

3.5. Environments

Use of HAI-LLM framework
Incorporation of various parallelism strategies
Experiments conducted on clusters with NVIDIA A100 and H800 GPUs

3.6. Long Context

Reconfiguration of RoPE parameters to extend default context window
Linear scaling strategy applied
Theoretical processing of up to 64K tokens in context
Reliable outputs within 16K token range

3.7. Instruction Tuning

Development of DeepSeek-Coder-Instruct through fine-tuning
Use of high-quality data structured by Alpaca Instruction format
Unique delimiter token <|EOT|> for dialogue turn demarcation
Training with cosine schedule and specific learning rate

4. Experimental Results

4.1. Code Generation

Evaluation on HumanEval and MBPP benchmarks
Expansion of HumanEval to seven additional programming languages
DeepSeek-Coder-Base 33B achieves state-of-the-art performance
DeepSeek-Coder-Base 6.7B performs on par with 34B parameter CodeLlama
DeepSeek-Coder-Instruct 33B outperforms OpenAI GPT-3.5 Turbo

4.2. Fill-in-the-Middle Code Completion

Evaluation on Single-Line Infilling benchmarks
DeepSeek-Coder outperforms larger counterparts
Correlation between model size and performance observed
Recommendation of DeepSeek-Coder-Base 6.7B for code completion tools

4.3. Cross-File Code Completion

Evaluation using CrossCodeEval dataset
DeepSeek-Coder consistently outperforms other models
Effectiveness of repository-level pre-training demonstrated

4.4. Program-based Math Reasoning

Evaluation using Program-Aided Math Reasoning (PAL) method
Assessment across seven distinct benchmarks
DeepSeek-Coder models achieve remarkable performance
Potential for complex mathematical computations and problem-solving demonstrated

5. Continue Pre-Training From General LLM

Additional pre-training from DeepSeek-LLM-7B Base on 2 trillion tokens
Resulting in DeepSeek-Coder-v1.5 7B
Use of specific data sources for pre-training
Comparison between DeepSeek-Coder-v1.5 7B and DeepSeek-Coder 6.7B
Evaluation across programming, math reasoning, and natural language tasks
DeepSeek-Coder-Base-v1.5 shows improvements in most tasks, especially in math reasoning and natural language processing

6. Conclusion

Introduction of DeepSeek-Coder series with 1.3B, 6.7B, and 33B parameters
Trained on meticulously curated project-level code corpus
Use of "fill-in-the-blank" pre-training objective
Extension of context window to 16,384 tokens
DeepSeek-Coder-Base 33B surpasses existing open-source code models
DeepSeek-Coder-Base 6.7B performs on par with 34B parameter CodeLlama
DeepSeek-Coder-Instruct 33B outperforms OpenAI GPT-3.5 Turbo in coding-related tasks
Development of DeepSeek-Coder-v1.5 with improved natural language understanding
Future commitment to develop and share more powerful code-focused LLMs