About Me

Introduction

I have completed my PhD as a joint student under the supervision of Prof. Jian Yin and Dr. Ming Zhou from Sun Yat-sen University and Microsoft Research Asia. I am working as a researcher at DeepSeek.

My research mainly focuses on natural language processing and code intelligence to enable computers to intelligently process, understand and generate both natural language and programming language. The long-term research goal is to develop artificial general intelligence to revolutionize the way computers interact with humans and handle complex tasks.

My research areas currently include: (1) Large Language Model; (2) Code Intelligence.

Educations

  • Sun Yat-sen University
  • PhD in Computer Science and Technology, from August 2018 to June 2023.
    Joint Ph.D. Program with Microsoft Research Asia

  • Sun Yat-sen University
  • B.S. in Computer Science and Technology, from Auguest 2014 to June 2018.

    Experiences

  • AI Researcher in DeepSeek
  • Work on code intelligence and mathematical reasoning, leading projects such as DeepSeek-Coder, DeepSeekMath, DeepSeek-Prover, and DeepSeek-Coder-V2, from July 2024 to Present.

  • Research Intern in Microsoft Research Asia
  • Mentored by Dr. Nan Duan in Natural Language Computing Group, from May 2020 to May 2024.

  • Research Intern in Microsoft Research Asia
  • Mentored by Dr. Duyu Tang in Natural Language Computing Group, from July 2017 to May 2020.

    Publications

    Below you can find highlighted publications and the full list of my publications.

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Arxiv 2024
    TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.

    DeepSeekMath: Pushing the limits of mathematical reasoning in open language models

    Arxiv 2024
    TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).

    DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence

    Arxiv 2024
    TLDR: This work introduces the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens, which achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.

    ICLR 2021
    TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

    Findings of EMNLP 2020
    TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

    2024

    RLCoder: Reinforcement Learning for Repository-Level Code Completion

    ICSE 2024 PDF Code
    Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng
    TLDR: RLCoder is a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data, and introduces a stop signal mechanism.

    SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

    SANER 2024 PDF Code
    Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang, and Zibin Zheng
    TLDR: The SparseCoder model, an identifier-aware sparse transformer for effectively handling long code sequences, employs a sliding window mechanism for self-attention to model short-term dependencies and leverages the structure message of code to capture long-term dependencies among source code identifiers.

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Arxiv 2024 PDF Code
    Qihao Zhu*, Daya Guo*, Zhihong Shao*, Dejian Yang*, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang (*Core Contributions)
    TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.

    DeepSeekMath: Pushing the limits of mathematical reasoning in open language models

    Arxiv 2024 PDF Code
    Zhihong Shao*, Peiyi Wang*, Qihao Zhu*, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo* (*Core Contributions)
    TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).

    DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence

    Arxiv 2024 PDF Code
    Daya Guo*, Qihao Zhu*, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang (*Core Contributions)
    TLDR: This work introduces the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens, which achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Arxiv 2024 PDF Code
    DeepSeek-AI Team
    TLDR: DeepSeek-V2 is a 236B parameter MoE model with 21B active per token and a 128K context length. It features Multi-head Latent Attention (MLA) for efficient inference and DeepSeekMoE for economical training. Compared to DeepSeek 67B, it reduces training costs by 42.5%, KV cache by 93.3%, and boosts throughput by 5.76 times.

    DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

    Arxiv 2024 PDF
    Huajian Xin, Daya Guo, Zhihong Shao, Z.Z. Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang
    TLDR: This work introduces an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems to enhance theorem-proving capabilities in LLMs and demonstrates the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs.

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Arxiv 2024 PDF Code
    DeepSeek-AI Team.
    TLDR: DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning, and open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.

    2023

    LongCoder: A Long-Range Pre-trained Language Model for Code Completion

    ICML 2023 PDF Code
    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley
    TLDR: The paper introduces LongCoder, a sparse Transformer model for code completion that handles long code inputs.

    Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data

    EMNLP 2023 PDF Code
    Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
    TLDR: A pipeline is proposed that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself and employs parameter-efficient tuning to enhance LLaMA, an open-source large language model.

    Noisy pair corrector for dense retrieval

    Findings of EMNLP 2023 PDF
    Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo
    TLDR: This paper proposes a novel approach to dense retrieval, how to train an effective model with mismatched-pair noise, called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module

    2022

    Learning to Complete Code with Sketches

    ICLR 2022 PDF
    Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis
    TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.

    UniXcoder: Unified Cross-Modal Pre-training for Code Representation

    ACL 2022 PDF Code
    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin
    TLDR: In this work, we present UniXcoder, a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

    ReACC: A Retrieval-Augmented Code Completion Framework

    ACL 2022 PDF Code
    Shuai Lu, Nan Duan, Hojae Han, Daya Guo, seung-won hwang, Alexey Svyatkovskiy
    TLDR: In this work, we propose ReACC, a retrieval-augmented code completion framework that utilizes external context for the code completion task by retrieving semantically and lexically similar codes from existing codebase.

    CodeReviewer: Pre-Training for Automating Code Review Activities

    FSE 2022 PDF
    Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan
    TLDR: In this work, we focus on pre-training techniques and introduce a pre-trained model CodeReviewer for automating code review activities.

    LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

    Findings of ACL 2022 PDF Code
    Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
    TLDR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.

    Soft-Labeled Contrastive Pre-training for Function-level Code Representation

    Findings of EMNLP 2022 PDF
    Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen and Nan Duan
    TLDR: In this paper, we present SCodeR to learn function-level code representation with soft-labeled contrastive pre-training.

    AR-LSAT: Investigating Analytical Reasoning of Text

    Findings of NAACL 2022 PDF Code and Dataset
    Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou and Nan Duan.
    TLDR: This paper studies the challenge of analytical reasoning of text and introduces a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016, and designs two different baselines which struggle to solve this task.

    2021

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    ICLR 2021 PDF Code
    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang and Ming Zhou
    TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    NeurIPS 2021 Datasets and Benchmarks Track PDF Dataset
    Shuai Lu, Daya Guo*, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu (*Equal Contributions)
    TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.

    Multi-modal Representation Learning for Video Advertisement Content Structuring

    ACM Multimedia 2021 PDF
    Daya Guo, Zhaoyang Zeng
    TLDR: In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. The framework achieves the 1st place on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge.

    Syntax-Enhanced Pre-trained Model

    ACL 2021 PDF Dataset
    Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Nan Duan and Daxin Jiang
    TLDR: We present a model that utilizes the syntax of text, i.e. dependency tree, in both pre-training and fine-tuning stages.

    2020

    Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder

    ACL 2020 PDF Code Slide Video
    Daya Guo, Duyu Tang, Nan Duan, Jian Yin, Daxin Jiang and Ming Zhou.
    TLDR: An approach equipped with a Vector Quantised-Variational Autoencoder that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts, which provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Findings of EMNLP 2020 PDF Code
    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou.
    TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

    Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

    AAAI 2020 PDF Code
    Shangwen Lv, Daya Guo*, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu (*Equal Contributions)
    TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.

    CodeBLEU: A Method for Automatic Evaluation of Code Synthesis

    Arxiv 2020 PDF
    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou,Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, Shuai Ma
    TLDR: This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.

    Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

    Arxiv 2020 PDF
    Daya Guo, Akari Asai, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Jian Yin, Ming Zhou
    TLDR: This work uses not only structured commonsense knowledge bases, but also natural language snippets from search-engine results incorporated into a generative base model via key-value memory network and introduces a meta-learning based multi-task learning algorithm.

    2019

    Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

    ACL 2019 PDF Slide
    Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
    TLDR: An approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment, and shows that both the context-aware retriever and the meta-learning strategy improve accuracy.

    Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base

    EMNLP 2019 PDF Code Video
    Tao Shen, Xiubo Geng, Tao Qin, Daya Guo, Duyu Tang, Nan Duan, Guodong Long, and Daxin Jiang
    TLDR: This work proposes an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model.

    Multi-modal Representation Learning for Short Video Understanding and Recommendation

    ICMEW 2019 PDF Code
    Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu
    TLDR: A multi-modal representation learning method to improve the performance of recommender systems and a novel Key-Value Memory to map dense real-values into vectors, which could obtain more sufficient semantic in a nonlinear manner.

    2018

    Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base

    NeurIPS 2018 PDF Code
    Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
    TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.

    Question Generation from SQL Queries Improves Neural Semantic Parsing

    EMNLP 2018 PDF Code
    Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin, Hong Chi, James Cao, Peng Chen, and Ming Zhou
    TLDR: This study conducts a study on WikiSQL, the largest hand-annotated semantic parsing dataset to date, and demonstrates that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data.

    Awards

    Data Mining Competition Awards

  • 1st Place Award of 2022 ATEC
  • 1st Place Award of 2022 Wechat Big Data Challenge [news]
  • 1st Place Award of 2021 ATEC
  • 2nd Place Award of 2021 Tencent QQ Brower Competition & ACM CIKM 2021 AnalyticCup [report]
  • 1st Place Award of 2021 Tencent Advertising Algorithm Competition & ACM Multimedia Grand Challenge [news]
  • 1st Place Award of 2020 Tencent Advertising Algorithm Competition [code] [news]
  • 1st Place Award of 2019 Tencent Advertising Algorithm Competition [code] [news]
  • 2nd Place Award of 2018 CCF Big Data and Computing Competition [code]
  • 6th Place Award of ICME 2019 & ByteDance Grand Challenge [code] [paper]

  • Academic Competition Awards

  • Meritorious Winner of 2017 Mathematical Contest in Modeling
  • Second Prize of 2017 Guangdong Collegiate Programming Contest

  • Scholarship

  • Microsoft Research Asia (MSRA) Fellowship, 2020 [news]
  • 12 outstanding Ph.D. students in the Asia-Pacific region
  • Sensetime Scholarship, 2017
  • 24 outstanding undergraduate students in China
  • National Scholarship
  • 2015, 2016 and 2020 in Sun Yat-sen University