University of Notre Dame
Browse
NguyenTQ072021D.pdf (22.51 MB)

Improving Neural Machine Translation for Low-Resource Languages

Download (22.51 MB)
thesis
posted on 2021-07-12, 00:00 authored by Toan Q. Nguyen

Neural Machine Translation (NMT), which exploits the power of continuous representation and neural networks, has become the de facto standard choice in both academic research and industry. While the early non-attentional, recurrent neural network-based NMT already showed impressive performance, with the invention of the attention mechanism, it has been pushed further and achieved state-of-the-art performance. However, like other neural networks, these NMT systems are often data-hungry and require millions of training examples to achieve competitive results.

Unfortunately, only a few language pairs have this privilege. Most fall into the low-resource domain and often have only around 100k training sentence pairs or less. Some examples could be the LORELEI or IWSLT datasets. This lack of data is particularly critical for the early recurrent neural network-based NMT systems, which are often outperformed by the traditional Phrase-based Machine Translation (PBMT) ones in low-resource scenarios. It also means more advanced techniques are required to effectively train a good NMT system for low-resource languages.

In this dissertation, I will show that with better use of training data, better normalization or improved model architecture, we can in fact build a competitive NMT system with only limited data at hand. On data, I propose two simple methods to improve NMT performance by better exploiting training resources. The first approach makes use of the relationship between languages for better transfer learning from one language pair to the other. The second one is a simple data augmentation via concatenation which can yield on average +1 BLEU on several language pairs. On normalization, I show how simple l2-based normalization at word embedding and hidden state levels can significantly improve translation for low-resource languages, with a particular focus on rare words. Finally, on model architecture, I investigate three simple yet effective changes. First, I propose a simple lexical module which can alleviate the mistranslation issue for rare words. Second, I study the Transformer model and show how a simple rearrangement of its components can improve both training and performance for low-resource languages. Lastly, I explore the untied positional attention in Transformer and demonstrate how it can improve both performance and interpretability.

History

Date Modified

2021-08-08

Defense Date

2021-06-28

CIP Code

  • 40.0501

Research Director(s)

David Chiang

Committee Members

Walter Scheirer Tim Weninger Kyunghyun Cho

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1262767206

Library Record

6103411

OCLC Number

1262767206

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC