Neural Models of Automated Documentation Generation for Source Code

LeClair, Alexander

doi:10.7274/rx913n23p3p

LeClairA042022D.pdf (1.64 MB)

Neural Models of Automated Documentation Generation for Source Code

thesis

posted on 2022-04-08, 00:00 authored by Alexander LeClair

Source code summarization is the task of writing a short natural language summary for a section of source code. The most common targets for summarization are a program's subroutines. For example, a Java method with its accompanying JavaDoc string. Summaries help programmers more quickly understand code. The backbone of current code summarization techniques is the attentional encoder-decoder neural architecture, such as a seq2seq-like model. The research frontier is in improving these models through more comprehensive representations of source code in the encoder.

In this dissertation, I cover my work in developing new techniques and insights for source code summarization. Specifically, I look at how we can achieve the best results with only information provided by the method itself. For my first project, I look at how code can be categorized into topics using pre-trained word embeddings and neural networks. We found that the inclusion of natural language in the pre-training of word embeddings improves the models ability to correctly categorize methods for the project category. We also found that general english language word embeddings such as those trained on wikipedia do not perform as well as embeddings trained on source code. For my second project, we built upon this work to begin generating full summaries of source code using sequence-to-sequence models. We found that including the AST as a separate input allows the model to learn structural information about the code, improving summary quality. For my third project, I look at how we can better represent the AST to a neural model. We developed a graph neural network based approach to encode the AST as a graph instead of a sequence. We found that this encoding created better representations of the source code and improved overall summary generation. Lastly, I study how combining models trained on different sets of information from source code would perform. In these studies I explore two common model ensembling techniques and analyzed how each models internal representations of source code differed, and how that might be affecting the summary output.

History

Date Modified

2022-05-10

Defense Date

2022-02-21

CIP Code

40.0501

Research Director(s)

Collin McMillan

Committee Members

David Chiang Jane Cleland-Huang Lingfei Wu

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Alternate Identifier

1314917022

Library Record

6209900

OCLC Number

1314917022

Program Name

Computer Science and Engineering

Usage metrics

Keywords

Not Assigned

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Neural Models of Automated Documentation Generation for Source Code

History

Date Modified

Defense Date

CIP Code

Research Director(s)

Committee Members

Degree

Degree Level

Alternate Identifier

Library Record

OCLC Number

Program Name

Usage metrics

Categories

Keywords

Licence

Exports