Linear Data Augmentation to Improve Generalization for Imbalanced Learning

Dablain, Damien A.

doi:10.7274/h128nc61d17

File(s) under embargo

Linear Data Augmentation to Improve Generalization for Imbalanced Learning

thesis

posted on 2024-03-25, 01:52 authored by Damien A. Dablain

Machine learning models struggle to generalize when the number of class instances is numerically imbalanced. Data augmentation (DA) is a leading approach to improve generalization for under-represented classes. Despite its wide-spread use, the mechanisms by which DA works are not clearly understood.

In this dissertation, we take a step toward understanding how DA works with imbalanced data. We begin by building three novel algorithms, which incorporate data augmentation, to improve generalization for under-represented classes. Based on insights gleaned from this process, we focus on the latent features learned by machine learning (ML) models as potential culprits in generalization. We design a suite of tools, with latent features, that can be used to understand data complexity and class overlap.

We also find that certain DA methods and parametric ML classifiers (CNN, logistic regression, SVM) incorporate hidden linearity at the front-end of training and during inference that may affect generalization, when learning with imbalanced data. Further, we demonstrate that parametric ML models rely heavily on the magnitude of a limited number of latent features. During inference, they predict classes based on a combination of latent feature magnitudes that sum to a requisite threshold.

History

Date Modified

2023-05-24

Defense Date

2023-05-02

CIP Code

40.0501

Research Director(s)

Nitesh V. Chawla

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Alternate Identifier

1379845315

OCLC Number

1379845315

Program Name

Computer Science and Engineering

Usage metrics

Keywords

Not Assigned

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under embargo

Linear Data Augmentation to Improve Generalization for Imbalanced Learning

History

Date Modified

Defense Date

CIP Code

Research Director(s)

Degree

Degree Level

Alternate Identifier

OCLC Number

Program Name

Usage metrics

Categories

Keywords

Licence

Exports