Data Mining for Biological Data Learning: Algorithm and Application

Doctoral Dissertation


Due to fast growing technology developments, large amounts of experimental data for complex biological systems have been increasingly available. For example, microarray technology enabled biologists to monitor the expression profiles of thousands of genes simultaneously, generating large volumes of gene expression data; next generation sequencing technology is leading to a DNA sequence data deluge. Life science researchers are accumulating massive data and the assumption is that something in the data will stimulate important questions and insights. This provides opportunities and challenges on how to efficiently and effectively leverage these data for novel discovery.

Data mining, which is the process of analyzing data from different perspectives and summarizing them into useful information and patterns, is of immense importance in bioinformatics and biomedical science more generally. In particular, supervised data mining has been used to great effect in numerous bioinformatics prediction problems. With more and different sources of data accumulating every day, it requires sophisticated computational analyses and data mining. One major bottleneck so far is how to analyze the huge noisy and heterogeneous data sets quickly and precisely. My PhD research focuses on applying data mining algorithms and tools to tackle these challenging and interesting computational problems in bioinformatics.

We first present a two-stage data mining approach for pathway analysis. During the first stage, informative genes that can represent a pathway are selected using feature selection methods. In the second stage, pathways are ranked based on their representative genes using classification methods.Then, we demonstrate a machine learning framework for trait based microbial ecology using whole genome sequence data. We use this framework to quantitatively link genotypes with functional traits. Finally, we extend the previous framework to handle continuous function traits. Specifically, we use Random Forest Regression to predict continuous functional traits based solely on whole genome sequences and identify a small set of biomarkers that are relevant to functional traits. We also incorporate network analysis by providing correlated information to further narrow down results.


Attribute NameValues
  • etd-12052013-151616

Author Wei Zhang
Advisor Scott Emrich
Contributor Scott Emrich, Committee Member
Contributor Kevin Bowyer, Committee Member
Contributor Stuart E. Jones, Committee Member
Contributor Timothy Weninger, Committee Member
Degree Level Doctoral Dissertation
Degree Discipline Computer Science and Engineering
Degree Name Doctor of Philosophy
Defense Date
  • 2013-12-02

Submission Date 2013-12-05
  • United States of America

  • bioinformatics

  • genomics

  • feature selection

  • machine learning

  • data mining

  • University of Notre Dame

  • English

Record Visibility Public
Content License
  • All rights reserved

Departments and Units

Digital Object Identifier


This DOI is the best way to cite this doctoral dissertation.


Please Note: You may encounter a delay before a download begins. Large or infrequently accessed files can take several minutes to retrieve from our archival storage system.