Machine Learning Methods for Single-Cell RNA-Sequencing Data Analysis

Wang, Chuanqi

doi:10.7274/m326m042z2v

File(s) under permanent embargo

Machine Learning Methods for Single-Cell RNA-Sequencing Data Analysis

thesis

posted on 2021-04-12, 00:00 authored by Chuanqi Wang

The development of the high-throughput RNA-sequencing (RNA-seq) technique at the single-cell level enables the measurement of gene expression in a large number of individual cells simultaneously. Computational methods previously developed for the regular (bulk-based) RNA-seq data analysis often have insufficient power on single-cell data, due to the high sparsity nature and the large variance of single-cell data. In this dissertation, we present three methods focusing on two central problems of single-cell RNA-seq (scRNA-seq) data analysis: classification of samples and differential expression (DE) analysis of genes.

The first method is a ``scale-invariant'' classifier used to classify cells/samples into different groups based on their gene expression. Scaling by sequencing depth is usually the first step for any analysis of RNA-seq data, but estimating sequencing depth accurately can be difficult, especially for single-cell data, which may influence the validity of downstream analysis. We eliminate the need of scaling completely and enable analyzing using the original count data directly by developing a scale-invariant classifier, which gives the same result under different (arbitrary) estimates of sequencing depth. Meanwhile, this deep-neural-network-based classifier is able to achieve higher classification accuracy compared to previous classifiers that require scaling.

The second method is proposed for DE analysis of scRNA-seq data. DE analysis identifies genes of which the expression level is different from biological groups to biological groups. Most of existing approaches detect the mean difference in expression levels in different groups, our approach, named FastDE, instead detects changes in distributions. Further, FastDE is capable of handling data from multiple nominal or ordinal biological conditions, and it provides an intuitive visualization to facilitate the interpretation of DE results.

The third method we present is for post-clustering DE analysis of scRNA-seq data. Different from DE analysis of the bulk-based RNA-seq data, where the grouping of samples is known beforehand, the grouping of samples (i.e., cells) is computationally inferred using a clustering algorithm on the same data. This 'double-use' of data brings a data snooping problem, bringing a high risk of overwhelming false positives. To lower this risk, we propose a post-clustering DE test called 'DEPOC', which takes into account the uncertainty of clustering. DEPOC is implemented as a modification of the popular edgeR software for DE analysis, which maximizes its ease of use. We study the theoretical properties of DEPOC and show that DEPOC typically gives more conservative p-values than edgeR.

History

Date Modified

2021-05-21

Defense Date

2021-04-01

CIP Code

27.9999

Research Director(s)

Jun Li

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Alternate Identifier

1251516164

Library Record

6022950

OCLC Number

1251516164

Program Name

Applied and Computational Mathematics and Statistics

Usage metrics

Keywords

Not Assigned

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Machine Learning Methods for Single-Cell RNA-Sequencing Data Analysis

History

Date Modified

Defense Date

CIP Code

Research Director(s)

Degree

Degree Level

Alternate Identifier

Library Record

OCLC Number

Program Name

Usage metrics

Categories

Keywords

Licence

Exports