University of Notre Dame
Browse

Linear Data Augmentation to Improve Generalization for Imbalanced Learning

thesis
posted on 2024-03-25, 01:52 authored by Damien A. Dablain
<p>Machine learning models struggle to generalize when the number of class instances is numerically imbalanced. Data augmentation (DA) is a leading approach to improve generalization for under-represented classes. Despite its wide-spread use, the mechanisms by which DA works are not clearly understood.</p><p>In this dissertation, we take a step toward understanding how DA works with imbalanced data. We begin by building three novel algorithms, which incorporate data augmentation, to improve generalization for under-represented classes. Based on insights gleaned from this process, we focus on the latent features learned by machine learning (ML) models as potential culprits in generalization. We design a suite of tools, with latent features, that can be used to understand data complexity and class overlap.</p><p>We also find that certain DA methods and parametric ML classifiers (CNN, logistic regression, SVM) incorporate hidden linearity at the front-end of training and during inference that may affect generalization, when learning with imbalanced data. Further, we demonstrate that parametric ML models rely heavily on the magnitude of a limited number of latent features. During inference, they predict classes based on a combination of latent feature magnitudes that sum to a requisite threshold. </p><p><br></p>

History

Date Modified

2023-05-24

Defense Date

2023-05-02

CIP Code

  • 40.0501

Research Director(s)

Nitesh V. Chawla

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1379845315

OCLC Number

1379845315

Additional Groups

  • Computer Science and Engineering

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC