DeepSeek-Coder: When the Large Language Model Meets Programming

Abstract

  • Open-source code models with sizes from 1.3B to 33B
  • Trained from scratch on 2 trillion tokens
  • Pre-trained on high-quality project-level code corpus
  • Employs fill-in-the-blank task with 16K window
  • Achieves state-of-the-art performance among open-source code models
  • Surpasses existing closed-source models like Codex and GPT-3.5
  • Under permissive license for research and commercial use

1. Introduction

  • Transformation of software development by large language models
  • Challenge: Performance gap between open-source and closed-source models
  • DeepSeek-Coder series introduced to address this challenge
  • Models trained from scratch on 2 trillion tokens from 87 programming languages
  • Pre-training data organized at repository level
  • Incorporation of Fill-In-Middle (FIM) approach
  • Context length extended to 16K
  • Comprehensive experiments conducted on public code-related benchmarks
  • DeepSeek-Coder-Base 33B outperforms existing open-source models
  • DeepSeek-Coder-Instruct 33B surpasses OpenAI GPT-3.5 Turbo in most benchmarks

Main Contributions

  1. Introduction of DeepSeek-Coder-Base and DeepSeek-Coder-Instruct
  2. Incorporation of repository-level data construction in pre-training
  3. Analysis of FIM training strategies impact on code models
  4. Extensive evaluations against various benchmarks

2. Data Collection

Training Dataset Composition

  • 87% source code
  • 10% English code-related natural language corpus
  • 3% code-unrelated Chinese natural language corpus

Data Creation Procedure

  1. Data crawling
  2. Rule-based filtering
  3. Dependency parsing
  4. Repository-level deduplication
  5. Quality screening

2.1. GitHub Data Crawling and Filtering

  • Collection of public repositories created before February 2023
  • Retention of 87 programming languages
  • Application of filtering rules similar to StarCoder project
  • Reduction of data to 32.8% of original size

2.2. Dependency Parsing

  • Consideration of dependencies between different files in a project
  • Use of topological sort algorithm for dependency analysis
  • Arrangement of files based on dependencies

2.3. Repo-Level Deduplication

  • Near-deduplication performed at repository level
  • Concatenated code from repository level treated as a single sample
  • Ensures integrity of repository structure

2.4. Quality Screening and Decontamination

  • Use of compiler and quality model with heuristic rules
  • Filtering of low-quality data (syntax errors, poor readability, low modularity)
  • N-gram filtering process to avoid contamination from test set
  • Exclusion of files containing docstrings, questions, and solutions from specific sources

3. Training Policy

3.1. Training Strategy

3.1.1. Next Token Prediction

  • Concatenation of files to form fixed-length entry
  • Training model to predict subsequent token based on provided context

3.1.2. Fill-in-the-Middle (FIM)

  • Random division of text into three parts
  • Shuffling of parts and connection with special characters
  • Two modes: PSM (Prefix-Suffix-Middle) and SPM (Suffix-Prefix-Middle)
  • Enhances model's capability to handle various structural arrangements in code

3.2. Tokenizer

  • Use of HuggingFace Tokenizer library
  • Training of Byte Pair Encoding (BPE) tokenizers
  • Vocabulary size of 32,000

3.3. Model Architecture

  • Range of models with 1.3B, 6.7B, and 33B parameters
  • Built on DeepSeek Large Language Model (LLM) framework
  • Decoder-only Transformer with Rotary Position Embedding (RoPE)
  • Integration of Grouped-Query-Attention (GQA) in 33B model
  • Use of FlashAttention v2 for attention mechanism computation

3.4. Optimization

  • Use of AdamW optimizer
  • Adaptation of batch sizes and learning rates
  • Three-stage learning rate scheduling policy

3.5. Environments

  • Use of HAI-LLM framework
  • Incorporation of various parallelism strategies
  • Experiments conducted on clusters with NVIDIA A100 and H800 GPUs

3.6. Long Context

  • Reconfiguration of RoPE parameters to extend default context window
  • Linear scaling strategy applied
  • Theoretical processing of up to 64K tokens in context
  • Reliable outputs within 16K token range

3.7. Instruction Tuning

  • Development of DeepSeek-Coder-Instruct through fine-tuning
  • Use of high-quality data structured by Alpaca Instruction format
  • Unique delimiter token <|EOT|> for dialogue turn demarcation
  • Training with cosine schedule and specific learning rate

4. Experimental Results

4.1. Code Generation

  • Evaluation on HumanEval and MBPP benchmarks
  • Expansion of HumanEval to seven additional programming languages
  • DeepSeek-Coder-Base 33B achieves state-of-the-art performance
  • DeepSeek-Coder-Base 6.7B performs on par with 34B parameter CodeLlama
  • DeepSeek-Coder-Instruct 33B outperforms OpenAI GPT-3.5 Turbo

4.2. Fill-in-the-Middle Code Completion

  • Evaluation on Single-Line Infilling benchmarks
  • DeepSeek-Coder outperforms larger counterparts
  • Correlation between model size and performance observed
  • Recommendation of DeepSeek-Coder-Base 6.7B for code completion tools

4.3. Cross-File Code Completion

  • Evaluation using CrossCodeEval dataset
  • DeepSeek-Coder consistently outperforms other models
  • Effectiveness of repository-level pre-training demonstrated

4.4. Program-based Math Reasoning

  • Evaluation using Program-Aided Math Reasoning (PAL) method
  • Assessment across seven distinct benchmarks
  • DeepSeek-Coder models achieve remarkable performance
  • Potential for complex mathematical computations and problem-solving demonstrated

5. Continue Pre-Training From General LLM

  • Additional pre-training from DeepSeek-LLM-7B Base on 2 trillion tokens
  • Resulting in DeepSeek-Coder-v1.5 7B
  • Use of specific data sources for pre-training
  • Comparison between DeepSeek-Coder-v1.5 7B and DeepSeek-Coder 6.7B
  • Evaluation across programming, math reasoning, and natural language tasks
  • DeepSeek-Coder-Base-v1.5 shows improvements in most tasks, especially in math reasoning and natural language processing

6. Conclusion

  • Introduction of DeepSeek-Coder series with 1.3B, 6.7B, and 33B parameters
  • Trained on meticulously curated project-level code corpus
  • Use of "fill-in-the-blank" pre-training objective
  • Extension of context window to 16,384 tokens
  • DeepSeek-Coder-Base 33B surpasses existing open-source code models
  • DeepSeek-Coder-Base 6.7B performs on par with 34B parameter CodeLlama
  • DeepSeek-Coder-Instruct 33B outperforms OpenAI GPT-3.5 Turbo in coding-related tasks
  • Development of DeepSeek-Coder-v1.5 with improved natural language understanding
  • Future commitment to develop and share more powerful code-focused LLMs