University of Notre Dame
Browse

File(s) under embargo

Linear Data Augmentation to Improve Generalization for Imbalanced Learning

thesis
posted on 2024-03-25, 01:52 authored by Damien A. Dablain

Machine learning models struggle to generalize when the number of class instances is numerically imbalanced. Data augmentation (DA) is a leading approach to improve generalization for under-represented classes. Despite its wide-spread use, the mechanisms by which DA works are not clearly understood.

In this dissertation, we take a step toward understanding how DA works with imbalanced data. We begin by building three novel algorithms, which incorporate data augmentation, to improve generalization for under-represented classes. Based on insights gleaned from this process, we focus on the latent features learned by machine learning (ML) models as potential culprits in generalization. We design a suite of tools, with latent features, that can be used to understand data complexity and class overlap.

We also find that certain DA methods and parametric ML classifiers (CNN, logistic regression, SVM) incorporate hidden linearity at the front-end of training and during inference that may affect generalization, when learning with imbalanced data. Further, we demonstrate that parametric ML models rely heavily on the magnitude of a limited number of latent features. During inference, they predict classes based on a combination of latent feature magnitudes that sum to a requisite threshold.


History

Date Modified

2023-05-24

Defense Date

2023-05-02

CIP Code

  • 40.0501

Research Director(s)

Nitesh V. Chawla

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1379845315

OCLC Number

1379845315

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC