Data Mining for Biological Data Learning: Algorithm and Application

Zhang, Wei

doi:10.7274/2514nk33x3t

ZhangW122013D.pdf (2.18 MB)

Data Mining for Biological Data Learning: Algorithm and Application

thesis

posted on 2013-12-05, 00:00 authored by Wei Zhang

Due to fast growing technology developments, large amounts of experimental data for complex biological systems have been increasingly available. For example, microarray technology enabled biologists to monitor the expression profiles of thousands of genes simultaneously, generating large volumes of gene expression data; next generation sequencing technology is leading to a DNA sequence data deluge. Life science researchers are accumulating massive data and the assumption is that something in the data will stimulate important questions and insights. This provides opportunities and challenges on how to efficiently and effectively leverage these data for novel discovery.

Data mining, which is the process of analyzing data from different perspectives and summarizing them into useful information and patterns, is of immense importance in bioinformatics and biomedical science more generally. In particular, supervised data mining has been used to great effect in numerous bioinformatics prediction problems. With more and different sources of data accumulating every day, it requires sophisticated computational analyses and data mining. One major bottleneck so far is how to analyze the huge noisy and heterogeneous data sets quickly and precisely. My PhD research focuses on applying data mining algorithms and tools to tackle these challenging and interesting computational problems in bioinformatics.

We first present a two-stage data mining approach for pathway analysis. During the first stage, informative genes that can represent a pathway are selected using feature selection methods. In the second stage, pathways are ranked based on their representative genes using classification methods.Then, we demonstrate a machine learning framework for trait based microbial ecology using whole genome sequence data. We use this framework to quantitatively link genotypes with functional traits. Finally, we extend the previous framework to handle continuous function traits. Specifically, we use Random Forest Regression to predict continuous functional traits based solely on whole genome sequences and identify a small set of biomarkers that are relevant to functional traits. We also incorporate network analysis by providing correlated information to further narrow down results.

History

Date Modified

2017-06-05

Defense Date

2013-12-02

Research Director(s)

Scott Emrich

Committee Members

Scott Emrich Kevin Bowyer Stuart E. Jones Timothy Weninger

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Language

English

Alternate Identifier

etd-12052013-151616

Publisher

University of Notre Dame

Program Name

Computer Science and Engineering

Usage metrics

Keywords

bioinformatics genomics feature selection machine learning data mining

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Data Mining for Biological Data Learning: Algorithm and Application

History

Date Modified

Defense Date

Research Director(s)

Committee Members

Degree

Degree Level

Language

Alternate Identifier

Publisher

Program Name

Usage metrics

Categories

Keywords

Licence

Exports