Daya Guo

AI Researcher
DeepSeek

About Me

Introduction

I have completed my PhD as a joint student under the supervision of Prof. Jian Yin and Dr. Ming Zhou from Sun Yat-sen University and Microsoft Research Asia. I am working as a researcher at DeepSeek.

My research mainly focuses on natural language processing and code intelligence to enable computers to intelligently process, understand and generate both natural language and programming language. The long-term research goal is to develop artificial general intelligence to revolutionize the way computers interact with humans and handle complex tasks.

My research areas currently include: (1) Large Language Model; (2) Code Intelligence.

Educations

Sun Yat-sen University

PhD in Computer Science and Technology, from August 2018 to June 2023.
Joint Ph.D. Program with Microsoft Research Asia

Sun Yat-sen University

B.S. in Computer Science and Technology, from Auguest 2014 to June 2018.

Experiences

AI Researcher in DeepSeek

Work on code intelligence and LLM reasoning from July 2023 to Present. Core contributor to DeepSeek-Coder, DeepSeekMath, DeepSeek-Prover, DeepSeek-V2, DeepSeek-Coder-V2, DeepSeek-V3 and DeepSeek-R1.

Research Intern in Microsoft Research Asia

Mentored by Dr. Nan Duan in Natural Language Computing Group, from May 2020 to May 2023.

Research Intern in Microsoft Research Asia

Mentored by Dr. Duyu Tang in Natural Language Computing Group, from July 2017 to May 2020.

Publications

Below you can find highlighted publications and the full list of my publications.

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Arxiv 2025

TLDR: This work introduces DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, trained via large-scale RL without SFT, exhibits strong reasoning. DeepSeek-R1, enhanced with multi-stage training and cold-start data, matches OpenAI-o1-1217 in reasoning tasks.

DeepSeek-V3 Technical Report

Arxiv 2024

TLDR: DeepSeek-V3 is a 671B parameter MoE model with 37B activated per token, using MLA and DeepSeekMoE for efficient training. It introduces load balancing without auxiliary loss and multi-token prediction, trained on 14.8T tokens. It outperforms open-source models, matches top closed-source ones, and trains stably in 2.788M H800 GPU hours.

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Arxiv 2024

TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models

Arxiv 2024

TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).

DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence

Arxiv 2024

TLDR: This work introduces the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens, which achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.

GraphCodeBERT: Pre-training Code Representations with Data Flow

ICLR 2021

TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Findings of EMNLP 2020

TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Arxiv 2025 PDF

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, etc.

TLDR: This work introduces DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, trained via large-scale RL without SFT, exhibits strong reasoning. DeepSeek-R1, enhanced with multi-stage training and cold-start data, matches OpenAI-o1-1217 in reasoning tasks.

2024

DeepSeek-V3 Technical Report

Arxiv 2024 PDF

DeepSeek-AI

TLDR: DeepSeek-V3 is a 671B parameter MoE model with 37B activated per token, using MLA and DeepSeekMoE for efficient training. It introduces load balancing without auxiliary loss and multi-token prediction, trained on 14.8T tokens. It outperforms open-source models, matches top closed-source ones, and trains stably in 2.788M H800 GPU hours.

RLCoder: Reinforcement Learning for Repository-Level Code Completion

ICSE 2024 PDF Code

Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng

TLDR: RLCoder is a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data, and introduces a stop signal mechanism.

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

SANER 2024 PDF Code

Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang, and Zibin Zheng

TLDR: The SparseCoder model, an identifier-aware sparse transformer for effectively handling long code sequences, employs a sliding window mechanism for self-attention to model short-term dependencies and leverages the structure message of code to capture long-term dependencies among source code identifiers.

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Arxiv 2024 PDF Code

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang (*Core Contributions)

TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models

Arxiv 2024 PDF Code

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo (*Core Contributions)

TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).

DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence

Arxiv 2024 PDF Code

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang (*Core Contributions)

TLDR: This work introduces the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens, which achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Arxiv 2024 PDF Code

DeepSeek-AI Team

TLDR: DeepSeek-V2 is a 236B parameter MoE model with 21B active per token and a 128K context length. It features Multi-head Latent Attention (MLA) for efficient inference and DeepSeekMoE for economical training. Compared to DeepSeek 67B, it reduces training costs by 42.5%, KV cache by 93.3%, and boosts throughput by 5.76 times.

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Arxiv 2024 PDF

Huajian Xin, Daya Guo, Zhihong Shao, Z.Z. Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

TLDR: This work introduces an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems to enhance theorem-proving capabilities in LLMs and demonstrates the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Arxiv 2024 PDF Code

DeepSeek-AI Team.

TLDR: DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning, and open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

2023

LongCoder: A Long-Range Pre-trained Language Model for Code Completion

ICML 2023 PDF Code

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley

TLDR: The paper introduces LongCoder, a sparse Transformer model for code completion that handles long code inputs.

Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data

EMNLP 2023 PDF Code

Canwen Xu, Daya Guo, Nan Duan, Julian McAuley (Equal Contributions)

TLDR: A pipeline is proposed that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself and employs parameter-efficient tuning to enhance LLaMA, an open-source large language model.

Noisy pair corrector for dense retrieval

Findings of EMNLP 2023 PDF

Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo

TLDR: This paper proposes a novel approach to dense retrieval, how to train an effective model with mismatched-pair noise, called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module

2022

Learning to Complete Code with Sketches

ICLR 2022 PDF

Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis

TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

ACL 2022 PDF Code

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin

TLDR: In this work, we present UniXcoder, a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

ReACC: A Retrieval-Augmented Code Completion Framework

ACL 2022 PDF Code

Shuai Lu, Nan Duan, Hojae Han, Daya Guo, seung-won hwang, Alexey Svyatkovskiy

TLDR: In this work, we propose ReACC, a retrieval-augmented code completion framework that utilizes external context for the code completion task by retrieving semantically and lexically similar codes from existing codebase.

CodeReviewer: Pre-Training for Automating Code Review Activities

FSE 2022 PDF

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan

TLDR: In this work, we focus on pre-training techniques and introduce a pre-trained model CodeReviewer for automating code review activities.

LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

Findings of ACL 2022 PDF Code

Canwen Xu, Daya Guo, Nan Duan, Julian McAuley (Equal Contributions)

TLDR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.

Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Findings of EMNLP 2022 PDF

Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen and Nan Duan

TLDR: In this paper, we present SCodeR to learn function-level code representation with soft-labeled contrastive pre-training.

AR-LSAT: Investigating Analytical Reasoning of Text

Findings of NAACL 2022 PDF Code and Dataset

Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou and Nan Duan.

TLDR: This paper studies the challenge of analytical reasoning of text and introduces a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016, and designs two different baselines which struggle to solve this task.

2021

GraphCodeBERT: Pre-training Code Representations with Data Flow

ICLR 2021 PDF Code

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang and Ming Zhou

TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

NeurIPS 2021 Datasets and Benchmarks Track PDF Dataset

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu (Equal Contributions)

TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.

Multi-modal Representation Learning for Video Advertisement Content Structuring

ACM Multimedia 2021 PDF

Daya Guo, Zhaoyang Zeng

TLDR: In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. The framework achieves the 1st place on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge.

Syntax-Enhanced Pre-trained Model

ACL 2021 PDF Dataset

Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Nan Duan and Daxin Jiang

TLDR: We present a model that utilizes the syntax of text, i.e. dependency tree, in both pre-training and fine-tuning stages.

2020

Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder

ACL 2020 PDF Code Slide Video

Daya Guo, Duyu Tang, Nan Duan, Jian Yin, Daxin Jiang and Ming Zhou.

TLDR: An approach equipped with a Vector Quantised-Variational Autoencoder that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts, which provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Findings of EMNLP 2020 PDF Code

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou.

TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

AAAI 2020 PDF Code

Shangwen Lv, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu (Equal Contributions)

TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.

CodeBLEU: A Method for Automatic Evaluation of Code Synthesis

Arxiv 2020 PDF

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou,Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, Shuai Ma

TLDR: This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.

Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

Arxiv 2020 PDF

Daya Guo, Akari Asai, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Jian Yin, Ming Zhou

TLDR: This work uses not only structured commonsense knowledge bases, but also natural language snippets from search-engine results incorporated into a generative base model via key-value memory network and introduces a meta-learning based multi-task learning algorithm.

2019

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

ACL 2019 PDF Slide

Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin

TLDR: An approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment, and shows that both the context-aware retriever and the meta-learning strategy improve accuracy.

Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base

EMNLP 2019 PDF Code Video

Tao Shen, Xiubo Geng, Tao Qin, Daya Guo, Duyu Tang, Nan Duan, Guodong Long, and Daxin Jiang

TLDR: This work proposes an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model.

Multi-modal Representation Learning for Short Video Understanding and Recommendation

ICMEW 2019 PDF Code

Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu

TLDR: A multi-modal representation learning method to improve the performance of recommender systems and a novel Key-Value Memory to map dense real-values into vectors, which could obtain more sufficient semantic in a nonlinear manner.

2018

Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base

NeurIPS 2018 PDF Code

Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin

TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.

Question Generation from SQL Queries Improves Neural Semantic Parsing

EMNLP 2018 PDF Code

Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin, Hong Chi, James Cao, Peng Chen, and Ming Zhou

TLDR: This study conducts a study on WikiSQL, the largest hand-annotated semantic parsing dataset to date, and demonstrates that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data.

Awards

Data Mining Competition Awards

1st Place Award of 2022 ATEC

1st Place Award of 2022 Wechat Big Data Challenge [news]

1st Place Award of 2021 ATEC

2nd Place Award of 2021 Tencent QQ Brower Competition & ACM CIKM 2021 AnalyticCup [report]

1st Place Award of 2021 Tencent Advertising Algorithm Competition & ACM Multimedia Grand Challenge [news]

1st Place Award of 2020 Tencent Advertising Algorithm Competition [code] [news]

1st Place Award of 2019 Tencent Advertising Algorithm Competition [code] [news]

2nd Place Award of 2018 CCF Big Data and Computing Competition [code]

6th Place Award of ICME 2019 & ByteDance Grand Challenge [code] [paper]

Academic Competition Awards

Meritorious Winner of 2017 Mathematical Contest in Modeling

Second Prize of 2017 Guangdong Collegiate Programming Contest

Scholarship

Microsoft Research Asia (MSRA) Fellowship, 2020 [news]

12 outstanding Ph.D. students in the Asia-Pacific region

Sensetime Scholarship, 2017

24 outstanding undergraduate students in China

National Scholarship

2015, 2016 and 2020 in Sun Yat-sen University

Activities

English

Conference Reviewer: ACL; EMNLP; NAACL; NLPCC

HuggingFace Contributions: codebert-base; graphcodebert-base

CodeXGLUE: A benchmark dataset and open challenge for code intelligence

Chinese

2022年微信大数据挑战赛冠军

代码智能新基准数据集CodeXGLUE来袭，多角度衡量模型优劣

2021年腾讯广告算法大赛冠军

2020年腾讯广告算法大赛冠军

2019年腾讯广告算法大赛冠军

CodeBERT: 面向编程语言的预训练模型

Daya Guo

About Me

Introduction

My research areas currently include: (1) Large Language Model; (2) Code Intelligence.

Educations

Sun Yat-sen University

Sun Yat-sen University

Experiences

AI Researcher in DeepSeek

Research Intern in Microsoft Research Asia

Research Intern in Microsoft Research Asia

Publications

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

TLDR: This work introduces DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, trained via large-scale RL without SFT, exhibits strong reasoning. DeepSeek-R1, enhanced with multi-stage training and cold-start data, matches OpenAI-o1-1217 in reasoning tasks.

DeepSeek-V3 Technical Report

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models

TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).

DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence

GraphCodeBERT: Pre-training Code Representations with Data Flow

TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

2025

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, etc.

TLDR: This work introduces DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, trained via large-scale RL without SFT, exhibits strong reasoning. DeepSeek-R1, enhanced with multi-stage training and cold-start data, matches OpenAI-o1-1217 in reasoning tasks.

2024

DeepSeek-V3 Technical Report

DeepSeek-AI

RLCoder: Reinforcement Learning for Repository-Level Code Completion

Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng

TLDR: RLCoder is a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data, and introduces a stop signal mechanism.

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang, and Zibin Zheng

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao*, Peiyi Wang*, Qihao Zhu*, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo* (*Core Contributions)

TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).

DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence

Daya Guo*, Qihao Zhu*, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang (*Core Contributions)

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI Team

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Huajian Xin, Daya Guo, Zhihong Shao, Z.Z. Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI Team.

TLDR: DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning, and open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

2023

LongCoder: A Long-Range Pre-trained Language Model for Code Completion

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley

TLDR: The paper introduces LongCoder, a sparse Transformer model for code completion that handles long code inputs.

Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data

Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)

TLDR: A pipeline is proposed that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself and employs parameter-efficient tuning to enhance LLaMA, an open-source large language model.

Noisy pair corrector for dense retrieval

Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo

TLDR: This paper proposes a novel approach to dense retrieval, how to train an effective model with mismatched-pair noise, called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module

2022

Learning to Complete Code with Sketches

Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis

TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin

TLDR: In this work, we present UniXcoder, a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

ReACC: A Retrieval-Augmented Code Completion Framework

Shuai Lu, Nan Duan, Hojae Han, Daya Guo, seung-won hwang, Alexey Svyatkovskiy

TLDR: In this work, we propose ReACC, a retrieval-augmented code completion framework that utilizes external context for the code completion task by retrieving semantically and lexically similar codes from existing codebase.

CodeReviewer: Pre-Training for Automating Code Review Activities

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan

TLDR: In this work, we focus on pre-training techniques and introduce a pre-trained model CodeReviewer for automating code review activities.

LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)

TLDR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.

Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen and Nan Duan

TLDR: In this paper, we present SCodeR to learn function-level code representation with soft-labeled contrastive pre-training.

AR-LSAT: Investigating Analytical Reasoning of Text

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo (*Core Contributions)

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang (*Core Contributions)

Canwen Xu, Daya Guo, Nan Duan, Julian McAuley (Equal Contributions)

Canwen Xu, Daya Guo, Nan Duan, Julian McAuley (Equal Contributions)

Shangwen Lv, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu (Equal Contributions)