Recent advances in single-cell and multi-omics technologies have enabled high-resolution profiling of cellular states, but also introduced new computational challenges. This dissertation presents machine learning methods to improve data quality and extract insights from high-dimensional, multimodal single-cell datasets.
First, we propose Decaf K-means, a clustering algorithm that accounts for cluster-specific confounding effects, such as batch variation, directly during clustering. This approach improves clustering accuracy in both synthetic and real data.
Second, we develop scPDA, a denoising method for droplet-based single-cell protein data that eliminates the need for empty droplets or null controls. scPDA models protein-protein relationships to enhance denoising accuracy and significantly improves cell-type identification.
Third, we introduce Scouter, a model that predicts transcriptional outcomes of unseen gene perturbations. Scouter combines neural networks with large language models to generalize across perturbations, reducing prediction error by over 50% compared to existing methods.
Finally, we extend this to TranScouter, which predicts transcriptional responses under new biological conditions without direct perturbation data. Using a tailored encoder-decoder architecture, TranScouter achieves accurate cross-condition predictions, paving the way for more generalizable models in perturbation biology.
History
Date Created
2025-05-29
Date Modified
2025-06-09
Defense Date
2025-03-28
CIP Code
27.9999
Research Director(s)
Jun Li
Committee Members
Xiufan Yu
Tiffany Tang
Degree
Doctor of Philosophy
Degree Level
Doctoral Dissertation
Language
English
Library Record
006714593
OCLC Number
1522965215
Publisher
University of Notre Dame
Additional Groups
Applied and Computational Mathematics and Statistics
Program Name
Applied and Computational Mathematics and Statistics