University of Notre Dame
Browse

File(s) under permanent embargo

Modeling and Mining on High-Dimensional Biological Data

thesis
posted on 2021-04-06, 00:00 authored by Hongyu Guo

The studies of statistics and biology are interdependent and mutually reinforcing. Breakthroughs in biology lead to the demand for novel statistical tools, and innovative data analysis methods contribute to new findings in biology. In this thesis, my efforts join the exploration of this actively studied area and result in two new data mining methods that address problems arising from biological data. Their applications are not limited to biological data but also to data from other fields.

The first method relates to the task of assigning samples to pre-defined groups. A simple and straightforward approach for this task would be to train a classifier on a reference dataset with known group labels and then use this classifier to classify the target dataset at hand. However, the disadvantage of this approach is obvious. A reference dataset is not always available especially for newly studied problems, and even it is, the transferability of the classifier from reference dataset to target dataset may be questionable when there are differences in experimental procedures used to generate these datasets.

In this thesis, a novel semi-supervised method is introduced to conduct sample group assignment, not based on a reference dataset but based on known information about which input features are important. Note that this information about input features is ``qualitative'' instead of ``quantitative''. Compared to previous methods, this method pays special attention to the excessive zeros observed on group-specific features, which is a quite common phenomenon in real-world data such as single-cell RNA sequencing datasets. These zeros could be true zeros due to the internal mechanism of the object under study or missing values introduced by the imperfect data collecting techniques. By designing a mechanism that allows samples to have low value on the group-specific features and borrows information from other highly variable features from the dataset, this method successfully copes with the observed excessive zeros and improves group assignment results compared to three other methods on this topic, which is demonstrated on both simulated and real datasets.

The second method deals with the challenge of a classification problem when the dimension of input varies. Off-the-shelf classifiers typically require a fixed dimension of input, that is, the numbers of input features are the same for all observations. In this thesis, a situation when the input dimension is different from observation to observation is considered. As an efficient way to capture network structures, graphlets, some small connected subgraphs of a large network, are used to classify protein structures into pre-defined groups by extracting and summarising structural features from protein structure network built from protein 3D information. The challenge arises from the variable-size feature of the graphlet-based measure, which is a matrix with the number of rows relates to the number of amino acids in the corresponding protein chain. New approaches are needed to use graphlet-based measures for classification as normal classification methods cannot directly accept such variable-size matrices as input.

In this thesis, a deep neural network-based classifier composed of both convolutional and recurrent layers is developed to tackle this challenge. The graphlet-based measure is also enhanced by a new way to apply graphlets on weighted protein structure network with a new weight definition also proposed in this study. Put together, this approach shows dramatic improvements in performance over existing graphlet-based approaches on 36 real datasets. Even comparing with the state-of-the-art approach, it almost halves the classification error. In addition to protein structure networks, this weighted-graphlet measure and DNN classifier can potentially be applied to classification of other weighted networks in computational biology as well as in other fields.

History

Date Modified

2021-05-11

CIP Code

  • 27.9999

Research Director(s)

Jun Li

Committee Members

Stefano Castruccio Steven Buechler

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1250265010

Library Record

6012888

OCLC Number

1250265010

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC