University of Notre Dame
Browse

Applications of Modern Machine Learning Approaches to Address Real World Problems

Download (2.15 MB)
dataset
posted on 2024-07-17, 18:50 authored by Tian Yan
In an era with increasingly available complex real-world data, various quantitative methods have been developed to extract valuable insights from the data. Machine learning (ML) techniques have proven to be instrumental in modeling such intricate data. This dissertation encompasses a collection of illustrative examples that showcase the efficacy of ML models in analyzing data from various domains. First, I introduce an innovative l0 regularization technique, coupled with Tucker decomposition, in the framework of tensor regression (TR) and apply it to simulated linear, binomial, and Poisson data and a real human face image dataset for age prediction. The results suggest improved predictions by TR with l0 regularization compared to other decomposition-based TR approaches, with or without regularization, while also being able to identify important predictors. Second, I investigate the shift in sentiment during and after the COVID-19 pandemic utilizing college subreddit data and examine the effects of different community-level factors on the sentiment. A pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) was used to learn text embedding from the Reddit messages, and a graph attention network (GAT) was leveraged to learn the relational information among posted messages. I applied model stacking to combine the prediction probabilities from RoBERTa and GAT to yield the final classification on sentiment and used a generalized linear mixed-effects model to estimate the effects of various covariates. It's found that the odds of negative sentiments in years 2020, 2021, and 2022 increased statistically significantly compared to the year 2019, with the year 2020 having the highest increase. Factors including in-person learning, larger enrollment numbers, being public rather than private school, and very high research activities also increase the odds of negative sentiments statistically significantly. Third, my collaborators and I leverage the CodeBERT model to predict simulated running time within gem5, a simulation framework for various computer architecture configurations. We generate a dataset that contains both C code scripts and their simulated running time in gem5. We applied the CodeBERT model in three distinct ways to predict the simulated running time and achieved a mean absolute error of 0.546 in regression, and an accuracy of 0.696 in classification. To our knowledge, this is the first work that uses ML models to predict gem5 simulation metrics. In summary, the work in this dissertation and the findings from each of the three projects demonstrate the effectiveness of various ML techniques in different learning tasks using real-world data of different types in various domains.

History

Date Created

2024-06-29

Date Modified

2024-07-17

Defense Date

2024-01-23

CIP Code

  • 27.9999

Research Director(s)

Fang Liu

Committee Members

Xiufan Yu Changbo Zhu

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Language

  • English

Library Record

006603659

OCLC Number

1446452187

Publisher

University of Notre Dame

Additional Groups

  • Applied and Computational Mathematics and Statistics

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC