Neural Machine Translation (NMT), which exploits the power of continuous representation and neural networks, has become the de facto standard choice in both academic research and industry. While the early non-attentional, recurrent neural network-based NMT already showed impressive performance, with the invention of the attention mechanism, it has been pushed further and achieved state-of-the-art performance. However, like other neural networks, these NMT systems are often data-hungry and require millions of training examples to achieve competitive results.
Unfortunately, only a few language pairs have this privilege. Most fall into the low-resource domain and often have only around 100k training sentence pairs or less. Some examples could be the LORELEI or IWSLT datasets. This lack of data is particularly critical for the early recurrent neural network-based NMT systems, which are often outperformed by the traditional Phrase-based Machine Translation (PBMT) ones in low-resource scenarios. It also means more advanced techniques are required to effectively train a good NMT system for low-resource languages.
In this dissertation, I will show that with better use of training data, better normalization or improved model architecture, we can in fact build a competitive NMT system with only limited data at hand. On data, I propose two simple methods to improve NMT performance by better exploiting training resources. The first approach makes use of the relationship between languages for better transfer learning from one language pair to the other. The second one is a simple data augmentation via concatenation which can yield on average +1 BLEU on several language pairs. On normalization, I show how simple l2-based normalization at word embedding and hidden state levels can significantly improve translation for low-resource languages, with a particular focus on rare words. Finally, on model architecture, I investigate three simple yet effective changes. First, I propose a simple lexical module which can alleviate the mistranslation issue for rare words. Second, I study the Transformer model and show how a simple rearrangement of its components can improve both training and performance for low-resource languages. Lastly, I explore the untied positional attention in Transformer and demonstrate how it can improve both performance and interpretability.