University of Notre Dame
Browse

File(s) under permanent embargo

Machine Learning Methods for Single-Cell RNA-Sequencing Data Analysis

thesis
posted on 2021-04-12, 00:00 authored by Chuanqi Wang

The development of the high-throughput RNA-sequencing (RNA-seq) technique at the single-cell level enables the measurement of gene expression in a large number of individual cells simultaneously. Computational methods previously developed for the regular (bulk-based) RNA-seq data analysis often have insufficient power on single-cell data, due to the high sparsity nature and the large variance of single-cell data. In this dissertation, we present three methods focusing on two central problems of single-cell RNA-seq (scRNA-seq) data analysis: classification of samples and differential expression (DE) analysis of genes.

The first method is a ``scale-invariant'' classifier used to classify cells/samples into different groups based on their gene expression. Scaling by sequencing depth is usually the first step for any analysis of RNA-seq data, but estimating sequencing depth accurately can be difficult, especially for single-cell data, which may influence the validity of downstream analysis. We eliminate the need of scaling completely and enable analyzing using the original count data directly by developing a scale-invariant classifier, which gives the same result under different (arbitrary) estimates of sequencing depth. Meanwhile, this deep-neural-network-based classifier is able to achieve higher classification accuracy compared to previous classifiers that require scaling.

The second method is proposed for DE analysis of scRNA-seq data. DE analysis identifies genes of which the expression level is different from biological groups to biological groups. Most of existing approaches detect the mean difference in expression levels in different groups, our approach, named FastDE, instead detects changes in distributions. Further, FastDE is capable of handling data from multiple nominal or ordinal biological conditions, and it provides an intuitive visualization to facilitate the interpretation of DE results.

The third method we present is for post-clustering DE analysis of scRNA-seq data. Different from DE analysis of the bulk-based RNA-seq data, where the grouping of samples is known beforehand, the grouping of samples (i.e., cells) is computationally inferred using a clustering algorithm on the same data. This 'double-use' of data brings a data snooping problem, bringing a high risk of overwhelming false positives. To lower this risk, we propose a post-clustering DE test called 'DEPOC', which takes into account the uncertainty of clustering. DEPOC is implemented as a modification of the popular edgeR software for DE analysis, which maximizes its ease of use. We study the theoretical properties of DEPOC and show that DEPOC typically gives more conservative p-values than edgeR.

History

Date Modified

2021-05-21

Defense Date

2021-04-01

CIP Code

  • 27.9999

Research Director(s)

Jun Li

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Alternate Identifier

1251516164

Library Record

6022950

OCLC Number

1251516164

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC