University of Notre Dame
Browse
LinJ062022D.pdf (2.67 MB)

Accurate Trace Link Generation for Querying Software Projects at Scale

Download (2.67 MB)
thesis
posted on 2022-06-19, 00:00 authored by Jinfeng Lin

During software development, various software artifacts such as requirements, test cases, design definitions and source code are created by engineers. Project stakeholders can leverage software traceability to establish connections among these artifacts and then use the trace links to issue trace queries that support diverse software development activities.

However, it can be challenging and arduous to create and maintain these links in large industrial projects. Researchers have proposed automated traceability techniques to create, maintain and leverage trace links. Computationally intensive techniques, that include repository mining and deep learning, have shown the capability of generating accurate trace links. The objective of achieving trusted, automated tracing techniques at industrial scale has not yet been successfully accomplished due to practical performance and accuracy challenges.

In order to adopt software traceability in production, it is vital to quickly generate accurate, trustworthy trace links. Our work focuses on two important aspects of software traceability, both of which are essential for its deployment in real-world software projects. First, we improve the link quality of the automated trace model by leveraging cutting-edge Natural Language Processing (NLP) and deep learning (DL) techniques. Through comprehending the semantic meaning of software artifacts, our approach can produce trace links with significantly better accuracy than classical tracing methods. Traditionally, low availability of manually created trace links for training purposes, has limited the application of deep learning trace models. However, the approach we propose in this work produces reliable links with limited labelled training data by transferring knowledge from adjacent domains. Secondly, we explore and enhance the performance of software traceability for both link generation and query answering. We formulate traceability computations as reusable workflows, and develop a prototype framework to execute traceability workflows on big data processing platforms. In addition, to more fully take advantage of high performance computation platforms, we have also addressed the speed of link generation by modifying the architecture of our deep learning model. Our approach can significantly reduce time complexity and achieve approximately equivalent accuracy with the origin model.

We also address a specific tracing challenges raised by our industry collaborators. Artifacts in their dataset are intermingled with English and a foreign language. To tackle this intermingled bilingual tracing problem, we proposed an approach leveraging an IR model, cross lingual word embedding and machine translation to mitigate the semantic gap introduced by the use of multiple languages. Although, our approach outperforms classical IR-based models, the generated links suffer from accuracy deficiencies. Therefore, we leverage deep learning based methods to further improve the tracing performance and to reduce the time for generating links for answering bilingual user query.

Finally, we look into the transfer learning solutions to further improve the deep learning models on natural language tracing issues. The last part of my work is adapting existing deep learning models for better tracing performance on domain specific projects. We explored three transferring strategies to obtain domain and tracing knowledge from open platforms such as Google and GitHub. The results show that our approach not only outperform other language model based deep learning models on resource rich scenarios, but also tackle resource limited scenarios that other deep learning models are incapable of handling.

History

Date Modified

2022-08-02

Defense Date

2022-06-07

CIP Code

  • 40.0501

Research Director(s)

Jane Cleland-Huang

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1338149898

Library Record

6263345

OCLC Number

1338149898

Rights Statement

https://creativecommons.org/licenses/by-nc/3.0/

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC