Linear Data Augmentation to Improve Generalization for Imbalanced Learning
thesis
posted on 2024-03-25, 01:52authored byDamien A. Dablain
<p>Machine learning models struggle to generalize when the number of class instances is numerically imbalanced. Data augmentation (DA) is a leading approach to improve generalization for under-represented classes. Despite its wide-spread use, the mechanisms by which DA works are not clearly understood.</p><p>In this dissertation, we take a step toward understanding how DA works with imbalanced data. We begin by building three novel algorithms, which incorporate data augmentation, to improve generalization for under-represented classes. Based on insights gleaned from this process, we focus on the latent features learned by machine learning (ML) models as potential culprits in generalization. We design a suite of tools, with latent features, that can be used to understand data complexity and class overlap.</p><p>We also find that certain DA methods and parametric ML classifiers (CNN, logistic regression, SVM) incorporate hidden linearity at the front-end of training and during inference that may affect generalization, when learning with imbalanced data. Further, we demonstrate that parametric ML models rely heavily on the magnitude of a limited number of latent features. During inference, they predict classes based on a combination of latent feature magnitudes that sum to a requisite threshold. </p><p><br></p>