University of Notre Dame
Browse
HaqueS072022D.pdf (2.3 MB)

Improved Source Code Summarization Using Context-Aware Inputs and Semantic Similarity Based Optimization

Download (2.3 MB)
thesis
posted on 2022-07-10, 00:00 authored by Sakib Haque

Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code themselves. The task of writing these descriptions is called ``source code summarization.” The state-of-the-art in source code summarization is an encoder-decoder neural network with attention. Current research in this area focuses on improving these models on the encoder-side by providing better representation of source code. In this dissertation, we present a collection of methods that continue this trend by incorporating context around the source code as well as providing better generalization for these models. First, we incorporate neighboring functions in a file to better predict unique words that do not appear in the function but do appear in the file. Next, we establish action word prediction as a novel sub-problem of source code summarization. Henceforth, we demonstrate that semantic similarity based evaluation metrics are better correlated to human judgement than n-gram matching metrics. Finally, we study the effect of label smoothing as a regularization technique to allow these models to generalize better.

History

Date Modified

2022-08-06

Defense Date

2022-05-12

CIP Code

  • 40.0501

Research Director(s)

Collin McMillan

Committee Members

Toby Li Tim Weninger Lingfei Wu

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1338306391

Library Record

6264101

OCLC Number

1338306391

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC