Improving Neural Machine Translation for Low-Resource Languages

Nguyen, Toan Q.

doi:10.7274/xd07gq7170m

NguyenTQ072021D.pdf (22.51 MB)

Improving Neural Machine Translation for Low-Resource Languages

thesis

posted on 2021-07-12, 00:00 authored by Toan Q. Nguyen

Neural Machine Translation (NMT), which exploits the power of continuous representation and neural networks, has become the de facto standard choice in both academic research and industry. While the early non-attentional, recurrent neural network-based NMT already showed impressive performance, with the invention of the attention mechanism, it has been pushed further and achieved state-of-the-art performance. However, like other neural networks, these NMT systems are often data-hungry and require millions of training examples to achieve competitive results.

Unfortunately, only a few language pairs have this privilege. Most fall into the low-resource domain and often have only around 100k training sentence pairs or less. Some examples could be the LORELEI or IWSLT datasets. This lack of data is particularly critical for the early recurrent neural network-based NMT systems, which are often outperformed by the traditional Phrase-based Machine Translation (PBMT) ones in low-resource scenarios. It also means more advanced techniques are required to effectively train a good NMT system for low-resource languages.

In this dissertation, I will show that with better use of training data, better normalization or improved model architecture, we can in fact build a competitive NMT system with only limited data at hand. On data, I propose two simple methods to improve NMT performance by better exploiting training resources. The first approach makes use of the relationship between languages for better transfer learning from one language pair to the other. The second one is a simple data augmentation via concatenation which can yield on average +1 BLEU on several language pairs. On normalization, I show how simple l2-based normalization at word embedding and hidden state levels can significantly improve translation for low-resource languages, with a particular focus on rare words. Finally, on model architecture, I investigate three simple yet effective changes. First, I propose a simple lexical module which can alleviate the mistranslation issue for rare words. Second, I study the Transformer model and show how a simple rearrangement of its components can improve both training and performance for low-resource languages. Lastly, I explore the untied positional attention in Transformer and demonstrate how it can improve both performance and interpretability.

History

Date Modified

2021-08-08

Defense Date

2021-06-28

CIP Code

40.0501

Research Director(s)

David Chiang

Committee Members

Walter Scheirer Tim Weninger Kyunghyun Cho

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Alternate Identifier

1262767206

Library Record

6103411

OCLC Number

1262767206

Program Name

Computer Science and Engineering

Usage metrics

Keywords

Natural Language Processing Neural Machine Translation Machine Translation Machine Learning

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Improving Neural Machine Translation for Low-Resource Languages

History

Date Modified

Defense Date

CIP Code

Research Director(s)

Committee Members

Degree

Degree Level

Alternate Identifier

Library Record

OCLC Number

Program Name

Usage metrics

Categories

Keywords

Licence

Exports