University of Notre Dame
Browse
BarronMP062018D.pdf (10.95 MB)

Statistical Machine Learning for Single-Cell RNA-Sequencing Data: Sparse Clustering for Cell Type Identification and Confounding Factor Removal

Download (10.95 MB)
thesis
posted on 2018-06-29, 00:00 authored by Martin P. Barron

The advent of single cell RNA-sequencing (scRNA-seq) has opened up the gates to a plethora of new research avenues. We stand to make great leaps forward in the understanding of how multi-celled organisms function by analyzing their component parts. We can now analyze the transcriptome landscape of developing stem cells, study how cancer cells change after treatment, examine how the body reacts to disease on a cellular level and create detailed taxonomies of the cellular composition of organisms.

However, considerable challenges remain before the full potential of scRNA-seq can be utilized, including the high dimensional nature of the data, noise, confounding factors and a lack of suitable methods to adequately compare populations of cells. While there is on-going development and refinement of sequencing procedures to reduce sources of technical noise and bias, this thesis will focus on the issues facing the computational analysis of scRNA-seq. Data coming from scRNA-seq experiments typically contain 100-10,000 cells and measurements for about 20,000-50,000 genes, this leads to a large \ extit{p} small \ extit{n} problem, leading to issues such as conventional clustering algorithms being ineffective due to the curse of dimensionality. A related issue is the proportion of noise in the data, and that the expression of many genes may be unrelated to the information we are trying to extract from the data. Confounding factors such as the cell-cycle can affect the expression of many genes, leading to subpopulations in the data being masked by its effect. Existing clustering algorithms are unsuitable for the simultaneous analysis of more than a single population at the one time, that is, clusters found in separate populations must be linked after the clustering has been completed, making the tracing of subpopulations of cells through a condition change very challenging. In particular if the transcriptome profile of a subpopulation of cells has changed due to the condition change then there is difficulty in determining whether that subpopulation has died out or survived the condition change.

This work will present two methods developed to improve the analysis of scRNA-seq data. The first of these, a sparse differential clustering algorithm, SparseDC, is the first clustering algorithm which is suitable for analyzing two populations in a high dimensional setting. This algorithm clusters cells from two populations simultaneously, identifies a unique set of characteristic markers for each of the clusters, links clusters which are present in both conditions and detects features which describe how the transcriptome profile has changed with the condition change. Furthermore, SparseDC only uses a subset of the total features in its solution allowing it to avoid the curse of dimensionality and work effectively when faced with data with many features such as scRNA-seq data. SparseDC uses a modified K-means clustering algorithm with added $\ell_1$ penalties to enforce sparsity in the marker genes and drive similar clusters from different conditions together.

The second method presented is a tool for identifying the cell-cycle effect in scRNA-seq data and removing it to improve downstream analysis. This method, ccRemover, makes use of prior knowledge about which genes are related to the cell-cycle. It first captures the sources of variation in the data from genes which are not annotated to the cycle, then uses the cell-cycle genes to test which sources of variation are related to the cell-cycle, these cell-cycle related signals are then removed from the data while other biological signals of interest are preserved.

The effectiveness of both methods is demonstrated through simulation and real data applications. Through these it is shown how both of the methods can enhance the analysis of scRNA-seq data and aid in the unlocking of the full potential of scRNA-seq data. Both methods are applicable to a range of other problems, ccRemover can be applied in any situation where an unwanted source of variation is known to have a stronger effect on a subset of features. SparseDC is applicable to any situation where the simultaneous analysis of multiple populations consisting of independent observations is undertaken, such as examining data on individuals from different cities or countries, or stock market data taken from an index at different periods of time.

History

Date Created

2018-06-29

Date Modified

2018-11-08

Defense Date

2018-06-14

Research Director(s)

Jun Li

Committee Members

Fang Liu Steven Buechler

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Program Name

  • Applied and Computational Mathematics and Statistics

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC