About Me

Introduction

I am a joint PhD student, supervised by Prof. Jian Yin and Dr. Ming Zhou, between Sun Yat-sen University and Microsoft Research Asia.

My research mainly focuses on natural language processing and code intelligence to enable computers to intelligently process, understand and generate both natural language and programming language. The long-term research goal is to develop artificial general intelligence to revolutionize the way computers interact with humans and handle complex tasks.

My research areas currently include: (1) Large Language Model; (2) Code Intelligence.

Educations

  • Sun Yat-sen University
  • PhD in Computer Science and Technology, from August 2018 to June 2023 (Expected).
    Joint Ph.D. Program with Microsoft Research Asia

  • Sun Yat-sen University
  • B.S. in Computer Science and Technology, from Auguest 2014 to June 2018.

    Experiences

  • Research Intern in Microsoft Research Asia
  • Mentored by Dr. Nan Duan in Natural Language Computing Group, from May 2020 to Present.

  • Research Intern in Microsoft Research Asia
  • Mentored by Dr. Duyu Tang in Natural Language Computing Group, from July 2017 to May 2020.

    Projects

    DeepSeek-Coder

    Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

    图像2 图像1
  • Massive Training Data: Trained on 2T tokens.
  • Highly Flexible & Scalable: Offered in model sizes of 1B, 7B, and 33B.
  • Superior Model Performance: State-of-the-art performance among publicly available code models.
  • Advanced Code Completion Capabilities: A window size of 16K, supporting project-level code completion and infilling tasks.
  • Publications

    Below you can find highlighted publications and the full list of my publications.

    2023

    LongCoder: A Long-Range Pre-trained Language Model for Code Completion

    ICML 2023 PDF Code
    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley
    TLDR: The paper introduces LongCoder, a sparse Transformer model for code completion that handles long code inputs.

    Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data

    Findings of EMNLP 2023 PDF Code
    Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
    TLDR: A pipeline is proposed that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself and employs parameter-efficient tuning to enhance LLaMA, an open-source large language model.

    2022

    Learning to Complete Code with Sketches

    ICLR 2022 PDF
    Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis
    TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.

    UniXcoder: Unified Cross-Modal Pre-training for Code Representation

    ACL 2022 PDF Code
    Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin
    TLDR: In this work, we present UniXcoder, a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.

    ReACC: A Retrieval-Augmented Code Completion Framework

    ACL 2022 PDF Code
    Shuai Lu, Nan Duan, Hojae Han, Daya Guo, seung-won hwang, Alexey Svyatkovskiy
    TLDR: In this work, we propose ReACC, a retrieval-augmented code completion framework that utilizes external context for the code completion task by retrieving semantically and lexically similar codes from existing codebase.

    CodeReviewer: Pre-Training for Automating Code Review Activities

    FSE 2022 PDF
    Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan
    TLDR: In this work, we focus on pre-training techniques and introduce a pre-trained model CodeReviewer for automating code review activities.

    LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

    Findings of ACL 2022 PDF Code
    Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
    TLDR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.

    Soft-Labeled Contrastive Pre-training for Function-level Code Representation

    Findings of EMNLP 2022 PDF
    Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen and Nan Duan
    TLDR: In this paper, we present SCodeR to learn function-level code representation with soft-labeled contrastive pre-training.

    AR-LSAT: Investigating Analytical Reasoning of Text

    Findings of NAACL 2022 PDF Code and Dataset
    Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou and Nan Duan.
    TLDR: This paper studies the challenge of analytical reasoning of text and introduces a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016, and designs two different baselines which struggle to solve this task.

    ICLR 2022
    TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.

    2021

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    ICLR 2021 PDF Code
    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang and Ming Zhou
    TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

    ICLR 2021
    TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    NeurIPS 2021 Datasets and Benchmarks Track PDF Dataset
    Shuai Lu, Daya Guo*, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu (*Equal Contributions)
    TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.

    NeurIPS 2021 Datasets and Benchmarks Track
    TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.

    Multi-modal Representation Learning for Video Advertisement Content Structuring

    ACM Multimedia 2021 PDF
    Daya Guo, Zhaoyang Zeng
    TLDR: In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. The framework achieves the 1st place on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge.

    Syntax-Enhanced Pre-trained Model

    ACL 2021 PDF Dataset
    Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Nan Duan and Daxin Jiang
    TLDR: We present a model that utilizes the syntax of text, i.e. dependency tree, in both pre-training and fine-tuning stages.

    2020

    2020

    Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder

    ACL 2020 PDF Code Slide Video
    Daya Guo, Duyu Tang, Nan Duan, Jian Yin, Daxin Jiang and Ming Zhou.
    TLDR: An approach equipped with a Vector Quantised-Variational Autoencoder that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts, which provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Findings of EMNLP 2020 PDF Code
    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou.
    TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

    Findings of EMNLP 2020
    TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

    Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

    AAAI 2020 PDF Code
    Shangwen Lv, Daya Guo*, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu (*Equal Contributions)
    TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.

    AAAI 2020
    TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Arxiv 2020 PDF
    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou,Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, Shuai Ma
    TLDR: This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.

    Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning

    Arxiv 2020 PDF
    Daya Guo, Akari Asai, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Jian Yin, Ming Zhou
    TLDR: This work uses not only structured commonsense knowledge bases, but also natural language snippets from search-engine results incorporated into a generative base model via key-value memory network and introduces a meta-learning based multi-task learning algorithm.

    2019

    Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

    ACL 2019 PDF Slide
    Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
    TLDR: An approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment, and shows that both the context-aware retriever and the meta-learning strategy improve accuracy.

    Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base

    EMNLP 2019 PDF Code Video
    Tao Shen, Xiubo Geng, Tao Qin, Daya Guo, Duyu Tang, Nan Duan, Guodong Long, and Daxin Jiang
    TLDR: This work proposes an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model.

    Multi-modal Representation Learning for Short Video Understanding and Recommendation

    ICMEW 2019 PDF Code
    Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu
    TLDR: A multi-modal representation learning method to improve the performance of recommender systems and a novel Key-Value Memory to map dense real-values into vectors, which could obtain more sufficient semantic in a nonlinear manner.

    2018

    Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base

    NeurIPS 2018 PDF Code
    Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
    TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.

    NeurIPS 2018
    TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.

    Question Generation from SQL Queries Improves Neural Semantic Parsing

    EMNLP 2018 PDF Code
    Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin, Hong Chi, James Cao, Peng Chen, and Ming Zhou
    TLDR: This study conducts a study on WikiSQL, the largest hand-annotated semantic parsing dataset to date, and demonstrates that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data.

    Awards

    Data Mining Competition Awards

  • 1st Place Award of 2022 Wechat Big Data Challenge [news]
  • 1st Place Award of 2021 ATEC
  • 2nd Place Award of 2021 Tencent QQ Brower Competition & ACM CIKM 2021 AnalyticCup [report]
  • 1st Place Award of 2021 Tencent Advertising Algorithm Competition & ACM Multimedia Grand Challenge [news]
  • 1st Place Award of 2020 Tencent Advertising Algorithm Competition [code] [news]
  • 1st Place Award of 2019 Tencent Advertising Algorithm Competition [code] [news]
  • 2nd Place Award of 2018 CCF Big Data and Computing Competition [code]
  • 6th Place Award of ICME 2019 & ByteDance Grand Challenge [code] [paper]

  • Academic Competition Awards

  • Meritorious Winner of 2017 Mathematical Contest in Modeling
  • Second Prize of 2017 Guangdong Collegiate Programming Contest

  • Scholarship

  • Microsoft Research Asia (MSRA) Fellowship, 2020 [news]
  • 12 outstanding Ph.D. students in the Asia-Pacific region
  • Sensetime Scholarship, 2017
  • 24 outstanding undergraduate students in China
  • National Scholarship
  • 2015, 2016 and 2020 in Sun Yat-sen University