University of Notre Dame
Browse

File(s) under permanent embargo

Robust Inference and Network Analysis for Non-Gaussian Gene-Expression Data

thesis
posted on 2017-04-13, 00:00 authored by Alicia T. Specht

My work in this thesis is concerning methods for robust inference and network analysis for non-Gaussian data, with a focus on the unique challenges presented by RNA-Sequencing and single-cell RNA-Sequencing applications. Two methods for constructing gene coexpression networks are presented---a robust method for RNA-Seq data, and a method for estimating directed networks from scRNA-Seq data---as well as a novel method for testing differences in gene expression.

The first proposed method provides a new way of estimating the correlation of non-Gaussian data, which in turn can be used to infer gene function. The most straightforward way of constructing a coexpression network is to connect gene pairs whose expressions are highly correlated under different experimental conditions. Usually, this correlation is measured by the Pearson's correlation coefficient, which, however, does not directly apply to data generated from RNA-Seq technique. RNA-Seq data are non-negative integers which cannot be properly modeled by a Gaussian distribution, and moreover, these counts have mean values that are proportional to the sequencing depths, and thus there are no identically distributed 'replicates.' Directly normalizing counts by the corresponding sequencing depths and then using Pearson's correlation coefficient can be of low efficiency. The proposed method, iCC, is a generalization of the Pearson's correlation coefficient that can be directly applied to RNA-Seq data. On simulation data, it shows higher efficiency in distinguishing coexpressed gene pairs from unrelated gene pairs. In a real dataset, iCC generates a coexpression network that appears to more closely agree with experimentally validated networks than other methods. More generally, iCC can be used for calculating the correlation coefficient for any two series of random variables.

The second proposed method is for constructing gene co-expression networks based on single-cell RNA-Sequencing data. The algorithm is called LEAP, or Lag-based Expression Associations for Pseudotime-series data, and utilizes the estimated pseudotime of the cells to find gene co-expression that involves time delay, building off of traditional time-series analysis techniques. Regular correlation-based GCNs only describe simultaneous gene co-expressions. By using the time information that is virtually freely available in scRNA-Seq data, LEAP is able to capture associations that were hidden by the time lags. The asymmetric associations detected by LEAP more likely reflect regulatory relationships as they describe which gene follows another gene in expression. Applied to a real data set, LEAP not only identifies more true relationships than a traditional correlation-based network, but also captures directed, and thereby regulatory, relationships.

Finally, the third method is a new way of detecting differentially expressed (DE) genes, which show different average expression levels in different sample groups, and thus can be important biological markers. While many methods have been proposed for detecting DE genes, and are generally very successful, these methods need to be further tailored and improved for cancerous data. Tumor samples often feature quite diverse expression---some even appear as huge outliers---and this diversity is much larger than that in the control group. The proposed method, DiPhiSeq, can detect not only genes that show different average expressions, but also genes that show different diversities of expressions in different groups. These 'differentially dispersed' genes can be important clinical markers. DiPhiSeq uses a redescending penalty on the quasi-likelihood function, and thus has superior robustness against outliers and other noise. Simulations and real data analysis demonstrate that DiPhiSeq outperforms existing methods in the presence of outliers, and identifies unique sets of genes.


History

Date Created

2017-04-13

Date Modified

2018-10-30

Defense Date

2017-04-06

Research Director(s)

Jun Li

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC