University of Notre Dame
Browse
LeClairA042022D.pdf (1.64 MB)

Neural Models of Automated Documentation Generation for Source Code

Download (1.64 MB)
thesis
posted on 2022-04-08, 00:00 authored by Alexander LeClair

Source code summarization is the task of writing a short natural language summary for a section of source code. The most common targets for summarization are a program's subroutines. For example, a Java method with its accompanying JavaDoc string. Summaries help programmers more quickly understand code. The backbone of current code summarization techniques is the attentional encoder-decoder neural architecture, such as a seq2seq-like model. The research frontier is in improving these models through more comprehensive representations of source code in the encoder.

In this dissertation, I cover my work in developing new techniques and insights for source code summarization. Specifically, I look at how we can achieve the best results with only information provided by the method itself. For my first project, I look at how code can be categorized into topics using pre-trained word embeddings and neural networks. We found that the inclusion of natural language in the pre-training of word embeddings improves the models ability to correctly categorize methods for the project category. We also found that general english language word embeddings such as those trained on wikipedia do not perform as well as embeddings trained on source code. For my second project, we built upon this work to begin generating full summaries of source code using sequence-to-sequence models. We found that including the AST as a separate input allows the model to learn structural information about the code, improving summary quality. For my third project, I look at how we can better represent the AST to a neural model. We developed a graph neural network based approach to encode the AST as a graph instead of a sequence. We found that this encoding created better representations of the source code and improved overall summary generation. Lastly, I study how combining models trained on different sets of information from source code would perform. In these studies I explore two common model ensembling techniques and analyzed how each models internal representations of source code differed, and how that might be affecting the summary output.

History

Date Modified

2022-05-10

Defense Date

2022-02-21

CIP Code

  • 40.0501

Research Director(s)

Collin McMillan

Committee Members

David Chiang Jane Cleland-Huang Lingfei Wu

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1314917022

Library Record

6209900

OCLC Number

1314917022

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC