Machine Translation, the subﬁeld of Computer Science that focuses on translating between two human languages, has greatly beneﬁted from neural networks. However, these neural machine translation systems have complicated architectures with many hyperparameters that need to be manually chosen. Frequently, these are selected either through a grid search over values, or by using values commonplace in the literature. However, these are not theoretically justiﬁed and the same values are not optimal for all language pairs and datasets.
Fortunately, the innate structure of the problem allows for optimization of these hyperparameters during training. Traditionally, the hyperparameters of a system are chosen and then a learning algorithm optimizes all of the parameters within the model. In this work, I propose three methods to learn the optimal hyperparameters during the training of the model, allowing for one step instead of two. First, I propose using group regularizers to learn the number, and size of, the hidden neural network layers. Second, I demonstrate how to use a perceptron-like tuning method to solve known problems of undertranslation and label bias. Finally, I propose an Expectation-Maximization based method to learn the optimal vocabulary size and granularity. Using various techniques from machine learning and numerical optimization, this dissertation covers how to learn hyperparameters of a Neural Machine Translation system while training the model itself.