University of Notre Dame
Browse
ChoudhuryO072017D.pdf (1.97 MB)

Expediting Analysis and Improving Fidelity of Big Data Genomics

Download (1.97 MB)
thesis
posted on 2017-07-07, 00:00 authored by Olivia Choudhury

Genomics, or the study of genome-derived data, has had widespread impact in applications including medicine, forensic science, human evolution, environmental science, and social science. The plummeting cost of genome sequencing in the last decade has spurred an exponential growth of genomic data. The rate of data generation from these sequencing techniques has outpaced computing throughput, as predicted by Moore's Law, causing a major bottleneck in the rate of data processing and analysis. Emerging genome data is also characterized by missing and erroneous values, that reduce data fidelity and limit its applicability for downstream analysis. This forms the basis of the following research questions: (i) Can we design frameworks that can expedite data analysis and enable efficient utilization of computational resources? (ii) Can we develop accurate and efficient algorithms to improve data fidelity in genomic applications?

We address the first problem by developing a parallel data analysis framework that accelerates large-scale comparative genomics applications. We identify that optimal data partitioning and caching significantly improve the performance of such framework. We further construct a predictive model to estimate runtime configurations that facilitate optimal utilization of cloud and cluster-based resources while executing data-intensive applications.

The fidelity of genomic data derived from next-generation sequencing techniques impacts downstream applications like genome-wide association study (GWAS) and genome assembly. For imputation of missing genotype data, we design an accurate, fast, and lightweight algorithm for both model (with a reference genotype panel) and non-model (without a reference genotype panel) organisms. To correct erroneous long reads generated by emerging sequencing techniques, we formulate a hybrid correction algorithm that determines a correction policy based on an optimal combination of base quality and similarity of aligned short reads. We extend the core algorithm by proposing an iterative learning paradigm that further improves its performance.

Our proposed data analysis framework is accessible to the scientific community and has been used to study the genomes of important plant species and malaria vector mosquitoes. The predictive models exhibit high accuracy in determining optimal parameters of operation on commercial cloud services like Amazon EC2 and Microsoft Azure. Finally, the imputation and error correction algorithms outperform state-of-the-art alternatives when tested on real data sets of plants, malarial mosquitoes, and humans. Hence, in this thesis, we present novel solutions to expedite data-parallel genomic applications while optimizing cloud and cluster-based resource utilization. We also design novel, accurate, and efficient algorithms to impute missing data and correct erroneous data in emerging genomic applications.

History

Date Created

2017-07-07

Date Modified

2018-11-02

Defense Date

2017-06-15

Research Director(s)

Scott Emrich

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC