DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Arxiv 2024
TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.
DeepSeekMath: Pushing the limits of mathematical reasoning in open language models
Arxiv 2024
TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).
DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence
Arxiv 2024
TLDR: This work introduces the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens, which achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.
ICLR 2021
TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.
Findings of EMNLP 2020
TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
2024
RLCoder: Reinforcement Learning for Repository-Level Code Completion
ICSE 2024
PDF
Code
Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng
TLDR: RLCoder is a novel reinforcement learning framework, which can enable the retriever to learn to retrieve useful content for code completion without the need for labeled data, and introduces a stop signal mechanism.
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization
SANER 2024
PDF
Code
Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang, and Zibin Zheng
TLDR: The SparseCoder model, an identifier-aware sparse transformer for effectively handling long code sequences, employs a sliding window mechanism for self-attention to model short-term dependencies and leverages the structure message of code to capture long-term dependencies among source code identifiers.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Arxiv 2024
PDF
Code
Qihao Zhu*, Daya Guo*, Zhihong Shao*, Dejian Yang*, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang (*Core Contributions)
TLDR: DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens, which substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks.
DeepSeekMath: Pushing the limits of mathematical reasoning in open language models
Arxiv 2024
PDF
Code
Zhihong Shao*, Peiyi Wang*, Qihao Zhu*, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo* (*Core Contributions)
TLDR: DeepSeekMath 7B, built on DeepSeek-Coder-Base-v1.5 7B with 120B math tokens, scores 51.7% on the MATH benchmark without external tools. It excels due to a refined data selection pipeline and Group Relative Policy Optimization (GRPO).
DeepSeek-Coder: When the Large Language Model Meets Programming--The Rise of Code Intelligence
Arxiv 2024
PDF
Code
Daya Guo*, Qihao Zhu*, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang (*Core Contributions)
TLDR: This work introduces the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens, which achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Arxiv 2024
PDF
Code
DeepSeek-AI Team
TLDR: DeepSeek-V2 is a 236B parameter MoE model with 21B active per token and a 128K context length. It features Multi-head Latent Attention (MLA) for efficient inference and DeepSeekMoE for economical training. Compared to DeepSeek 67B, it reduces training costs by 42.5%, KV cache by 93.3%, and boosts throughput by 5.76 times.
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
Arxiv 2024
PDF
Huajian Xin, Daya Guo, Zhihong Shao, Z.Z. Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang
TLDR: This work introduces an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems to enhance theorem-proving capabilities in LLMs and demonstrates the potential of leveraging large-scale synthetic data to enhance theorem-proving capabilities in LLMs.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Arxiv 2024
PDF
Code
DeepSeek-AI Team.
TLDR: DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning, and open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
2023
LongCoder: A Long-Range Pre-trained Language Model for Code Completion
ICML 2023
PDF
Code
Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley
TLDR: The paper introduces LongCoder, a sparse Transformer model for code completion that handles long code inputs.
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
EMNLP 2023
PDF
Code
Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
TLDR: A pipeline is proposed that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT to engage in a conversation with itself and employs parameter-efficient tuning to enhance LLaMA, an open-source large language model.
Noisy pair corrector for dense retrieval
Findings of EMNLP 2023
PDF
Hang Zhang, Yeyun Gong, Xingwei He, Dayiheng Liu, Daya Guo, Jiancheng Lv, Jian Guo
TLDR: This paper proposes a novel approach to dense retrieval, how to train an effective model with mismatched-pair noise, called Noisy Pair Corrector (NPC), which consists of a detection module and a correction module
2022
Learning to Complete Code with Sketches
ICLR 2022
PDF
Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis
TLDR: Automatically generate (code) sketches, placing holes where ambiguity prevents us predicting terminal tokens.
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
ACL 2022
PDF
Code
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin
TLDR: In this work, we present UniXcoder, a unified cross-modal pre-trained model for programming languages to support both code-related understanding and generation tasks.
ReACC: A Retrieval-Augmented Code Completion Framework
ACL 2022
PDF
Code
Shuai Lu, Nan Duan, Hojae Han, Daya Guo, seung-won hwang, Alexey Svyatkovskiy
TLDR: In this work, we propose ReACC, a retrieval-augmented code completion framework that utilizes external context for the code completion task by retrieving semantically and lexically similar codes from existing codebase.
CodeReviewer: Pre-Training for Automating Code Review Activities
FSE 2022
PDF
Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan
TLDR: In this work, we focus on pre-training techniques and introduce a pre-trained model CodeReviewer for automating code review activities.
LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval
Findings of ACL 2022
PDF
Code
Canwen Xu, Daya Guo*, Nan Duan, Julian McAuley (*Equal Contributions)
TLDR: In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for zero-shot text retrieval.
Soft-Labeled Contrastive Pre-training for Function-level Code Representation
Findings of EMNLP 2022
PDF
Xiaonan Li, Daya Guo, Yeyun Gong, Yun Lin, Yelong Shen, Xipeng Qiu, Daxin Jiang, Weizhu Chen and Nan Duan
TLDR: In this paper, we present SCodeR to learn function-level code representation with soft-labeled contrastive pre-training.
AR-LSAT: Investigating Analytical Reasoning of Text
Findings of NAACL 2022
PDF
Code and Dataset
Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou and Nan Duan.
TLDR: This paper studies the challenge of analytical reasoning of text and introduces a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016, and designs two different baselines which struggle to solve this task.
2021
GraphCodeBERT: Pre-training Code Representations with Data Flow
ICLR 2021
PDF
Code
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang and Ming Zhou
TLDR: We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code, i.e. data flow, for pretraining.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
NeurIPS 2021 Datasets and Benchmarks Track
PDF
Dataset
Shuai Lu, Daya Guo*, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu (*Equal Contributions)
TLDR: This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
Multi-modal Representation Learning for Video Advertisement Content Structuring
ACM Multimedia 2021
PDF
Daya Guo, Zhaoyang Zeng
TLDR: In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. The framework achieves the 1st place on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge.
Syntax-Enhanced Pre-trained Model
ACL 2021
PDF
Dataset
Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan, Nan Duan and Daxin Jiang
TLDR: We present a model that utilizes the syntax of text, i.e. dependency tree, in both pre-training and fine-tuning stages.
2020
Evidence-Aware Inferential Text Generation with Vector Quantised Variational AutoEncoder
ACL 2020
PDF
Code
Slide
Video
Daya Guo, Duyu Tang, Nan Duan, Jian Yin, Daxin Jiang and Ming Zhou.
TLDR: An approach equipped with a Vector Quantised-Variational Autoencoder that automatically finds evidence for an event from a large text corpus, and leverages the evidence to guide the generation of inferential texts, which provides state-of-the-art performance on both Event2Mind and ATOMIC datasets.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Findings of EMNLP 2020
PDF
Code
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou.
TLDR: This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering
AAAI 2020
PDF
Code
Shangwen Lv, Daya Guo*, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, and Songlin Hu (*Equal Contributions)
TLDR: This work proposes to automatically extract evidence from heterogeneous knowledge sources, and answer questions based on the extracted evidence, and achieves the state-of-the-art accuracy on the CommonsenseQA leaderboard.
CodeBLEU: A Method for Automatic Evaluation of Code Synthesis
Arxiv 2020
PDF
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou,Shujie Liu, Duyu Tang, Ming Zhou, Ambrosio Blanco, Shuai Ma
TLDR: This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.
Inferential Text Generation with Multiple Knowledge Sources and Meta-Learning
Arxiv 2020
PDF
Daya Guo, Akari Asai, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Jian Yin, Ming Zhou
TLDR: This work uses not only structured commonsense knowledge bases, but also natural language snippets from search-engine results incorporated into a generative base model via key-value memory network and introduces a meta-learning based multi-task learning algorithm.
2019
Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing
ACL 2019
PDF
Slide
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
TLDR: An approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment, and shows that both the context-aware retriever and the meta-learning strategy improve accuracy.
Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base
EMNLP 2019
PDF
Code
Video
Tao Shen, Xiubo Geng, Tao Qin, Daya Guo, Duyu Tang, Nan Duan, Guodong Long, and Daxin Jiang
TLDR: This work proposes an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model.
Multi-modal Representation Learning for Short Video Understanding and Recommendation
ICMEW 2019
PDF
Code
Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu
TLDR: A multi-modal representation learning method to improve the performance of recommender systems and a novel Key-Value Memory to map dense real-values into vectors, which could obtain more sufficient semantic in a nonlinear manner.
2018
Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base
NeurIPS 2018
PDF
Code
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin
TLDR: An approach to map utterances in conversation to logical forms, which will be executed on a large-scale knowledge base, and shows that the semantic parsing-based approach outperforms a memory network based encoder-decoder model by a huge margin.
Question Generation from SQL Queries Improves Neural Semantic Parsing
EMNLP 2018
PDF
Code
Daya Guo, Yibo Sun, Duyu Tang, Nan Duan, Jian Yin, Hong Chi, James Cao, Peng Chen, and Ming Zhou
TLDR: This study conducts a study on WikiSQL, the largest hand-annotated semantic parsing dataset to date, and demonstrates that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data.