University of Notre Dame
ZhangW122013D.pdf (2.18 MB)

Data Mining for Biological Data Learning: Algorithm and Application

Download (2.18 MB)
posted on 2013-12-05, 00:00 authored by Wei Zhang

Due to fast growing technology developments, large amounts of experimental data for complex biological systems have been increasingly available. For example, microarray technology enabled biologists to monitor the expression profiles of thousands of genes simultaneously, generating large volumes of gene expression data; next generation sequencing technology is leading to a DNA sequence data deluge. Life science researchers are accumulating massive data and the assumption is that something in the data will stimulate important questions and insights. This provides opportunities and challenges on how to efficiently and effectively leverage these data for novel discovery.

Data mining, which is the process of analyzing data from different perspectives and summarizing them into useful information and patterns, is of immense importance in bioinformatics and biomedical science more generally. In particular, supervised data mining has been used to great effect in numerous bioinformatics prediction problems. With more and different sources of data accumulating every day, it requires sophisticated computational analyses and data mining. One major bottleneck so far is how to analyze the huge noisy and heterogeneous data sets quickly and precisely. My PhD research focuses on applying data mining algorithms and tools to tackle these challenging and interesting computational problems in bioinformatics.

We first present a two-stage data mining approach for pathway analysis. During the first stage, informative genes that can represent a pathway are selected using feature selection methods. In the second stage, pathways are ranked based on their representative genes using classification methods.Then, we demonstrate a machine learning framework for trait based microbial ecology using whole genome sequence data. We use this framework to quantitatively link genotypes with functional traits. Finally, we extend the previous framework to handle continuous function traits. Specifically, we use Random Forest Regression to predict continuous functional traits based solely on whole genome sequences and identify a small set of biomarkers that are relevant to functional traits. We also incorporate network analysis by providing correlated information to further narrow down results.


Date Modified


Defense Date


Research Director(s)

Scott Emrich

Committee Members

Scott Emrich Kevin Bowyer Stuart E. Jones Timothy Weninger


  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation


  • English

Alternate Identifier



University of Notre Dame

Program Name

  • Computer Science and Engineering

Usage metrics



    No categories selected


    Ref. manager