Statistical Machine Learning for Single-Cell RNA-Sequencing Data: Sparse Clustering for Cell Type Identification and Confounding Factor Removal

Doctoral Dissertation


The advent of single cell RNA-sequencing (scRNA-seq) has opened up the gates to a plethora of new research avenues. We stand to make great leaps forward in the understanding of how multi-celled organisms function by analyzing their component parts. We can now analyze the transcriptome landscape of developing stem cells, study how cancer cells change after treatment, examine how the body reacts to disease on a cellular level and create detailed taxonomies of the cellular composition of organisms.

However, considerable challenges remain before the full potential of scRNA-seq can be utilized, including the high dimensional nature of the data, noise, confounding factors and a lack of suitable methods to adequately compare populations of cells. While there is on-going development and refinement of sequencing procedures to reduce sources of technical noise and bias, this thesis will focus on the issues facing the computational analysis of scRNA-seq. Data coming from scRNA-seq experiments typically contain 100-10,000 cells and measurements for about 20,000-50,000 genes, this leads to a large \ extit{p} small \ extit{n} problem, leading to issues such as conventional clustering algorithms being ineffective due to the curse of dimensionality. A related issue is the proportion of noise in the data, and that the expression of many genes may be unrelated to the information we are trying to extract from the data. Confounding factors such as the cell-cycle can affect the expression of many genes, leading to subpopulations in the data being masked by its effect. Existing clustering algorithms are unsuitable for the simultaneous analysis of more than a single population at the one time, that is, clusters found in separate populations must be linked after the clustering has been completed, making the tracing of subpopulations of cells through a condition change very challenging. In particular if the transcriptome profile of a subpopulation of cells has changed due to the condition change then there is difficulty in determining whether that subpopulation has died out or survived the condition change.

This work will present two methods developed to improve the analysis of scRNA-seq data. The first of these, a sparse differential clustering algorithm, SparseDC, is the first clustering algorithm which is suitable for analyzing two populations in a high dimensional setting. This algorithm clusters cells from two populations simultaneously, identifies a unique set of characteristic markers for each of the clusters, links clusters which are present in both conditions and detects features which describe how the transcriptome profile has changed with the condition change. Furthermore, SparseDC only uses a subset of the total features in its solution allowing it to avoid the curse of dimensionality and work effectively when faced with data with many features such as scRNA-seq data. SparseDC uses a modified K-means clustering algorithm with added $\ell_1$ penalties to enforce sparsity in the marker genes and drive similar clusters from different conditions together.

The second method presented is a tool for identifying the cell-cycle effect in scRNA-seq data and removing it to improve downstream analysis. This method, ccRemover, makes use of prior knowledge about which genes are related to the cell-cycle. It first captures the sources of variation in the data from genes which are not annotated to the cycle, then uses the cell-cycle genes to test which sources of variation are related to the cell-cycle, these cell-cycle related signals are then removed from the data while other biological signals of interest are preserved.

The effectiveness of both methods is demonstrated through simulation and real data applications. Through these it is shown how both of the methods can enhance the analysis of scRNA-seq data and aid in the unlocking of the full potential of scRNA-seq data. Both methods are applicable to a range of other problems, ccRemover can be applied in any situation where an unwanted source of variation is known to have a stronger effect on a subset of features. SparseDC is applicable to any situation where the simultaneous analysis of multiple populations consisting of independent observations is undertaken, such as examining data on individuals from different cities or countries, or stock market data taken from an index at different periods of time.


Attribute NameValues
Author Martin P. Barron
Contributor Fang Liu, Committee Member
Contributor Steven Buechler, Committee Member
Contributor Jun Li, Research Director
Degree Level Doctoral Dissertation
Degree Discipline Applied and Computational Mathematics and Statistics
Degree Name PhD
Defense Date
  • 2018-06-14

Submission Date 2018-06-29
  • Confounding factor removal

  • Sparse Clustering

  • Single-cell RNA-sequencing

Record Visibility Public
Content License
  • All rights reserved

Departments and Units


Please Note: You may encounter a delay before a download begins. Large or infrequently accessed files can take several minutes to retrieve from our archival storage system.