In an era with increasingly available complex real-world data, various quantitative methods have been developed to extract valuable insights from the data. Machine learning (ML) techniques have proven to be instrumental in modeling such intricate data. This dissertation encompasses a collection of illustrative examples that showcase the efficacy of ML models in analyzing data from various domains.
First, I introduce an innovative l0 regularization technique, coupled with Tucker decomposition, in the framework of tensor regression (TR) and apply it to simulated linear, binomial, and Poisson data and a real human face image dataset for age prediction. The results suggest improved predictions by TR with l0 regularization compared to other decomposition-based TR approaches, with or without regularization, while also being able to identify important predictors.
Second, I investigate the shift in sentiment during and after the COVID-19 pandemic utilizing college subreddit data and examine the effects of different community-level factors on the sentiment. A pre-trained Robustly Optimized BERT pre-training approach (RoBERTa) was used to learn text embedding from the Reddit messages, and a graph attention network (GAT) was leveraged to learn the relational information among posted messages. I applied model stacking to combine the prediction
probabilities from RoBERTa and GAT to yield the final classification on sentiment and used a generalized
linear mixed-effects model to estimate the effects of various covariates. It's found that the odds of negative sentiments in years 2020, 2021, and 2022 increased statistically significantly compared to the year 2019, with the year 2020 having the highest increase. Factors including in-person learning, larger enrollment numbers, being public rather than private school, and very high research activities also increase the odds of negative sentiments statistically significantly.
Third, my collaborators and I leverage the CodeBERT model to predict simulated running time within gem5, a simulation framework for various computer architecture configurations. We generate a dataset that contains both C code scripts and their simulated running time in gem5. We applied the CodeBERT model in three distinct ways to predict the simulated running time and achieved a mean absolute error of 0.546 in regression, and an accuracy of 0.696 in classification. To our knowledge, this is the first work that uses ML models to predict gem5 simulation metrics.
In summary, the work in this dissertation and the findings from each of the three projects demonstrate the effectiveness of various ML techniques in different learning tasks using real-world data of different types in various domains.
History
Date Created
2024-06-29
Date Modified
2024-07-17
Defense Date
2024-01-23
CIP Code
27.9999
Research Director(s)
Fang Liu
Committee Members
Xiufan Yu
Changbo Zhu
Degree
Doctor of Philosophy
Degree Level
Doctoral Dissertation
Language
English
Library Record
006603659
OCLC Number
1446452187
Publisher
University of Notre Dame
Additional Groups
Applied and Computational Mathematics and Statistics
Program Name
Applied and Computational Mathematics and Statistics