Introduction
I am a joint PhD student, supervised by Prof. Jian Yin and Dr. Ming Zhou,
between Sun Yat-sen University and Microsoft Research Asia.
My research mainly focuses on natural language processing and code intelligence to enable computers to intelligently process, understand and generate both natural language and programming language. The long-term research goal is to develop artificial general intelligence to revolutionize the way computers interact with humans and handle complex tasks.
My research areas currently include: (1) Large Language Model; (2) Code Intelligence.
Educations
Sun Yat-sen University
PhD in Computer Science and Technology, from August 2018 to June 2023 (Expected).
Joint Ph.D. Program with Microsoft Research Asia
Sun Yat-sen University
B.S. in Computer Science and Technology, from Auguest 2014 to June 2018.
Experiences
Research Intern in Microsoft Research Asia
Mentored by Dr. Nan Duan in Natural Language Computing Group, from May 2020 to Present.
Research Intern in Microsoft Research Asia
Mentored by Dr. Duyu Tang in Natural Language Computing Group, from July 2017 to May 2020.
DeepSeek-Coder
Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
Massive Training Data: Trained on 2T tokens.
Highly Flexible & Scalable: Offered in model sizes of 1B, 7B, and 33B.
Superior Model Performance: State-of-the-art performance among publicly available code models.
Advanced Code Completion Capabilities: A window size of 16K, supporting project-level code completion and infilling tasks.
Publications
Below you can find highlighted publications and the full list of my publications.
2023
LongCoder: A Long-Range Pre-trained Language Model for Code Completion
ICML 2023
PDF
Code
Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley
TLDR: The paper introduces LongCoder, a sparse Transformer model for code completion that handles long code inputs.
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
Findings of EMNLP 2023
PDF
Code
Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
TLDR: A pipeline is proposed that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself and employs parameter-efficient tuning to enhance LLaMA, an open-source large language model.
2022
Learning to Complete Code with Sketches
ICLR 2022
PDF
Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis
TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
ACL 2022
PDF
Code
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin
TLDR: In this work, we present UniXcoder, a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.
ReACC: A Retrieval-Augmented Code Completion Framework
ACL 2022
PDF
Code
Shuai Lu, Nan Duan, Hojae Han, Daya Guo, seung-won hwang, Alexey Svyatkovskiy
TLDR: In this work, we propose ReACC, a retrieval-augmented code completion framework that utilizes external context for the code completion task by retrieving semantically and lexically similar codes from existing codebase.
CodeReviewer: Pre-Training for Automating Code Review Activities
FSE 2022
PDF
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan
TLDR: In this work, we focus on pre-training techniques and introduce a pre-trained model CodeReviewer for automating code review activities.
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval
Findings of ACL 2022
PDF
Code
Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
TLDR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.
Soft-Labeled Contrastive Pre-training for Function-level Code Representation
Findings of EMNLP 2022
PDF
Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen and Nan Duan
TLDR: In this paper, we present SCodeR to learn function-level code representation with soft-labeled contrastive pre-training.
AR-LSAT: Investigating Analytical Reasoning of Text
Findings of NAACL 2022
PDF
Code and Dataset
Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou and Nan Duan.
TLDR: This paper studies the challenge of analytical reasoning of text and introduces a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016, and designs two different baselines which struggle to solve this task.
ICLR 2022
TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.
2021
GraphCodeBERT: Pre-training Code Representations with Data Flow
ICLR 2021
PDF
Code
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang and Ming Zhou
TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.
ICLR 2021
TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
NeurIPS 2021 Datasets and Benchmarks Track
PDF
Dataset
Shuai Lu, Daya Guo*, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu (*Equal Contributions)
TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
NeurIPS 2021 Datasets and Benchmarks Track
TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
Multi-modal Representation Learning for Video Advertisement Content Structuring
ACM Multimedia 2021
PDF
Daya Guo, Zhaoyang Zeng
TLDR: In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. The framework achieves the 1st place on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge.
Syntax-Enhanced Pre-trained Model
ACL 2021
PDF
Dataset
Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Nan Duan and Daxin Jiang
TLDR: We present a model that utilizes the syntax of text, i.e. dependency tree, in both pre-training and fine-tuning stages.
2020
2020
Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder
ACL 2020
PDF
Code
Slide
Video
Daya Guo, Duyu Tang, Nan Duan, Jian Yin, Daxin Jiang and Ming Zhou.
TLDR: An approach equipped with a Vector Quantised-Variational Autoencoder that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts, which provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Findings of EMNLP 2020
PDF
Code
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou.
TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Findings of EMNLP 2020
TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering
AAAI 2020
PDF
Code
Shangwen Lv, Daya Guo*, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu (*Equal Contributions)
TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.
AAAI 2020
TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
Arxiv 2020
PDF
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou,Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, Shuai Ma
TLDR: This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.
Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning
Arxiv 2020
PDF
Daya Guo, Akari Asai, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Jian Yin, Ming Zhou
TLDR: This work uses not only structured commonsense knowledge bases, but also natural language snippets from search-engine results incorporated into a generative base model via key-value memory network and introduces a meta-learning based multi-task learning algorithm.
2019
Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing
ACL 2019
PDF
Slide
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
TLDR: An approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment, and shows that both the context-aware retriever and the meta-learning strategy improve accuracy.
Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base
EMNLP 2019
PDF
Code
Video
Tao Shen, Xiubo Geng, Tao Qin, Daya Guo, Duyu Tang, Nan Duan, Guodong Long, and Daxin Jiang
TLDR: This work proposes an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model.
Multi-modal Representation Learning for Short Video Understanding and Recommendation
ICMEW 2019
PDF
Code
Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu
TLDR: A multi-modal representation learning method to improve the performance of recommender systems and a novel Key-Value Memory to map dense real-values into vectors, which could obtain more sufficient semantic in a nonlinear manner.
2018
Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base
NeurIPS 2018
PDF
Code
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.
NeurIPS 2018
TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.
Question Generation from SQL Queries Improves Neural Semantic Parsing
EMNLP 2018
PDF
Code
Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin, Hong Chi, James Cao, Peng Chen, and Ming Zhou
TLDR: This study conducts a study on WikiSQL, the largest hand-annotated semantic parsing dataset to date, and demonstrates that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data.
Data Mining Competition Awards
1st Place Award of 2022 Wechat Big Data Challenge [news]
1st Place Award of 2021 ATEC
2nd Place Award of 2021 Tencent QQ Brower Competition & ACM CIKM 2021 AnalyticCup [report]
1st Place Award of 2021 Tencent Advertising Algorithm Competition & ACM Multimedia Grand Challenge [news]
1st Place Award of 2020 Tencent Advertising Algorithm Competition [code] [news]
1st Place Award of 2019 Tencent Advertising Algorithm Competition [code] [news]
2nd Place Award of 2018 CCF Big Data and Computing Competition [code]
6th Place Award of ICME 2019 & ByteDance Grand Challenge [code] [paper]
Academic Competition Awards
Meritorious Winner of 2017 Mathematical Contest in Modeling
Second Prize of 2017 Guangdong Collegiate Programming Contest
Scholarship
Microsoft Research Asia (MSRA) Fellowship, 2020 [news]
12 outstanding Ph.D. students in the Asia-Pacific region
Sensetime Scholarship, 2017
24 outstanding undergraduate students in China
National Scholarship
2015, 2016 and 2020 in Sun Yat-sen University