University of Notre Dame
Browse
YuW072023D.pdf (7.33 MB)

Knowledge Augmented Methods for Natural Language Processing and Beyond

Download (7.33 MB)
thesis
posted on 2023-07-17, 00:00 authored by Wenhao Yu

The advent of pre-trained language models (PLMs) has indisputably revolutionized the field of natural language processing (NLP). Prior to their emergence, NLP research predominantly revolved around feature extraction and architecture engineering. However, the introduction of PLMs instigated a paradigm shift, with the spotlight now focusing on pre-training and fine-tuning approaches, evolving recently towards prompt-based methodologies. Impressively, language models pre-trained on a broad spectrum of web data have shown an exceptional capacity to internalize a range of parametric knowledge, including factual and commonsense knowledge.

Despite the substantial progress in the domain of PLMs, they are not immune to certain limitations. Notably, they struggle with memorizing infrequent information, are susceptible to hallucinations, and often experience temporal degradation. Furthermore, PLMs are inherently unable to capture the entirety of the continuously evolving world knowledge due to their constrained parameter size. The NLP community has recently seen a surge in interest to enrich language models with non-parametric knowledge, yielding state-of-the-art results across diverse benchmarks. Unlike conventional PLMs that solely leverage their parametric knowledge, these methods resort directly to relevant external non-parametric knowledge, such as retrieved documents from Wikipedia, to enhance the language model's understanding of the input data. Furthermore, the non-parametric knowledge is acquired explicitly in a plug-and-play manner, without requiring extensive retraining, leading to great scalability. Simultaneously, it is also crucial to acknowledge that the scaling of model parameters has led to significant improvements in the parametric knowledge contained within very large PLMs, such as GPT-3. This facet of PLMs offers unique benefits, including inherent reasoning and deductive capabilities based on multi-dimensional knowledge acquired from varied sources, which cannot be dismissed. Additionally, language models that forego the retrieval of non-parametric knowledge demonstrate enhanced efficiency.

This thesis endeavors to address limitations present in contemporary PLMs and offers a two-fold contribution to overcome these limitations. The first part of the thesis examines issues resulting from an excessive dependency on parametric knowledge in PLMs. Such issues include ineffective memorization of infrequent information, hallucination, and lack of language diversity. To overcome the limitations, the thesis introduces new approaches which involve integrating non-parametric knowledge resources, like dictionaries, knowledge graphs, and unstructured text, into language models. The integration is achieved through carefully crafted pre-training goals and fine-tuning methods. The latter half of the thesis tackles with the issue of irrelevant retrievals during non-parametric knowledge integration. In response to this, the thesis introduces a novel pipeline that generates context documents instead of retrieving them from the web. To deal with issues related to low-diversity context generation and the generation of unrealistic responses, the thesis further introduces clustering-based prompt strategies and feedback mechanisms. Moreover, the thesis provides an extensive comparison of parametric and non-parametric knowledge, relying on both qualitative and quantitative analysis, illuminating their impacts and integrations across different tasks. In conclusion, the thesis paves the way for future research, contributing significantly to the advancement of this crucial field.

History

Date Modified

2023-08-04

Defense Date

2023-07-05

CIP Code

  • 40.0501

Research Director(s)

Meng Jiang

Committee Members

Nitesh Chawla David Chiang Heng Ji Scott Yih

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1392285960

OCLC Number

1392285960

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC