File(s) under permanent embargo
Modeling and Mining on High-Dimensional Biological Data
The studies of statistics and biology are interdependent and mutually reinforcing. Breakthroughs in biology lead to the demand for novel statistical tools, and innovative data analysis methods contribute to new findings in biology. In this thesis, my efforts join the exploration of this actively studied area and result in two new data mining methods that address problems arising from biological data. Their applications are not limited to biological data but also to data from other fields.
The first method relates to the task of assigning samples to pre-defined groups. A simple and straightforward approach for this task would be to train a classifier on a reference dataset with known group labels and then use this classifier to classify the target dataset at hand. However, the disadvantage of this approach is obvious. A reference dataset is not always available especially for newly studied problems, and even it is, the transferability of the classifier from reference dataset to target dataset may be questionable when there are differences in experimental procedures used to generate these datasets.
In this thesis, a novel semi-supervised method is introduced to conduct sample group assignment, not based on a reference dataset but based on known information about which input features are important. Note that this information about input features is ``qualitative'' instead of ``quantitative''. Compared to previous methods, this method pays special attention to the excessive zeros observed on group-specific features, which is a quite common phenomenon in real-world data such as single-cell RNA sequencing datasets. These zeros could be true zeros due to the internal mechanism of the object under study or missing values introduced by the imperfect data collecting techniques. By designing a mechanism that allows samples to have low value on the group-specific features and borrows information from other highly variable features from the dataset, this method successfully copes with the observed excessive zeros and improves group assignment results compared to three other methods on this topic, which is demonstrated on both simulated and real datasets.
The second method deals with the challenge of a classification problem when the dimension of input varies. Off-the-shelf classifiers typically require a fixed dimension of input, that is, the numbers of input features are the same for all observations. In this thesis, a situation when the input dimension is different from observation to observation is considered. As an efficient way to capture network structures, graphlets, some small connected subgraphs of a large network, are used to classify protein structures into pre-defined groups by extracting and summarising structural features from protein structure network built from protein 3D information. The challenge arises from the variable-size feature of the graphlet-based measure, which is a matrix with the number of rows relates to the number of amino acids in the corresponding protein chain. New approaches are needed to use graphlet-based measures for classification as normal classification methods cannot directly accept such variable-size matrices as input.
In this thesis, a deep neural network-based classifier composed of both convolutional and recurrent layers is developed to tackle this challenge. The graphlet-based measure is also enhanced by a new way to apply graphlets on weighted protein structure network with a new weight definition also proposed in this study. Put together, this approach shows dramatic improvements in performance over existing graphlet-based approaches on 36 real datasets. Even comparing with the state-of-the-art approach, it almost halves the classification error. In addition to protein structure networks, this weighted-graphlet measure and DNN classifier can potentially be applied to classification of other weighted networks in computational biology as well as in other fields.
History
Date Modified
2021-05-11CIP Code
- 27.9999
Research Director(s)
Jun LiCommittee Members
Stefano Castruccio Steven BuechlerDegree
- Doctor of Philosophy
Degree Level
- Doctoral Dissertation
Alternate Identifier
1250265010Library Record
6012888OCLC Number
1250265010Program Name
- Applied and Computational Mathematics and Statistics