University of Notre Dame
Browse

File(s) under permanent embargo

Machine Learning Methods for High-Dimensional Data and Multimodal Single-Cell Data

thesis
posted on 2022-07-11, 00:00 authored by Zixuan Song

Deep neural networks are famous for their high prediction accuracy, but they are also known for their black-box nature and poor interpretability. We consider the problem of variable selection, that is, selecting the input variables that have significant predictive power on the output, in deep neural networks. Most existing variable selection methods for neural networks are only applicable to shallow networks or are computationally infeasible on large datasets; moreover, they lack a control on the quality of selected variables. Here we propose a backward elimination procedure called SurvNet, which is based on a new measure of variable importance that applies to a wide variety of networks. More importantly, SurvNet is able to estimate and control the false discovery rate of selected variables empirically. Further, SurvNet adaptively determines how many variables to eliminate at each step in order to maximize the selection efficiency. The validity and efficiency of SurvNet are shown on various simulated and real datasets, and its performance is compared with other methods. Especially, a systematic comparison with knockoff-based methods shows that although they have more rigorous false discovery rate control on data with strong variable correlation, SurvNet usually has higher power.

Autoencoders are a type of deep neural network that are usually applied in unsupervised learning for dimensionality reduction. Based on network structures with multiple layers, autoencoders are also poorly interpretable, that is, the low-dimensional representations generated from input variables are difficult to interpret. Hence, we propose SurvEncoder, an unsupervised variable selection method for autoencoders. It defines an easily computable and widely applicable variable importance measure based on the encodings of autoencoders. By adopting the selection procedure of SurvNet and modifying it for unsupervised settings, SurvEncoder is equipped with FDR control as well. On various unlabeled datasets, SurvEncoder has effectively selected the variables that significantly contribute to dimension reduction and help reveal data structure.

The advent of single-cell multimodal sequencing technologies enables measurements of multiple modalities in individual cells and opens new avenues to characterize cell types and states. It also necessitates the development of novel analytical and computational methods for integration of multiple measurements, which is challenging due to the heterogeneity of different data types. We propose an unsupervised method to integrate multimodal omics data called VIMMO. It defines an integrative cell-cell distance measure and quantifies the utility of each modality for distance integration. Compared with WNN, a popular benchmark integrative method that assigns cell-specific modality weights, VIMMO is able to provide particular modality weights for each pair of cells. Furthermore, it offers greater flexibility as it supports user-defined modality weights for integration. On several real and simulated single-cell omics datasets, the performance of VIMMO is studied and compared to that of WNN, and it is found that VIMMO performed better than WNN in various scenarios.

History

Date Modified

2022-08-06

Defense Date

2022-06-27

CIP Code

  • 27.9999

Research Director(s)

Jun Li

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1338309508

Library Record

6264132

OCLC Number

1338309508

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC