Publications
Single Cell Transcriptomic Analytics: Methods Development, Benchmarking, and Applications in Biological Research
Abstract
Single-cell RNA sequencing (scRNA-seq) technology has recently emerged as a powerful tool for surveying cell types and state transitions over thousands of individual cells to answer long-standing questions in organ development, maintenance, and disorders such as infertility and cancer. The more recent development of spatial transcriptomics technologies has further advanced our understanding of cells' spatial relations by profiling the gene expression and spatial locations simultaneously for over 102-103 molecular entities. Importantly, these combined multidimensional single-cell technologies have provided unprecedented opportunities to resolve the cellular and structural heterogeneity of tissues. Meanwhile, the high dimensional and sparse counts data generated from single-cell methods bring new computational and statistical challenges for data analytics. Therefore, my research centers on applying scRNA-seq and spatial transcriptomic technologies to understand the functional heterogeneity of complex tissues, while addressing and overcoming many data science challenges surrounding analyses and data integration. My dissertation describes four major efforts (Chapters 2-5). First, I performed a comparative single cell RNA-seq analysis of the spermatogenesis program in mammals (mouse, monkey, human) (Chapter 2). By computationally aligning germ cell states across species, I was able to uncover conserved and divergent genes in the gametogenesis program, which gave new insights into core mammal spermatogenesis program and species-specific functions. To extend our scRNA-seq discoveries and new hypotheses, we began to apply spatial transcriptomics methods to testis samples (Chapter 5). Due to the lack of a consensus workflow for analyzing imaging-based data with combinatorial barcoding, I built a data processing pipeline to analyze highly multiplexed single-molecule in situ hybridization data. The pipeline includes a customized experimental workflow able to characterize hundreds of genes, diverse cell types, and their spatial positions. While analyzing scRNA-seq data, I encountered several new computational challenges, and developed several new methods to address them. In Chapter 3 I describe a method for computationally estimating a cell's "stemness" and the correction of bias due to "library size". Stem cells, progenitor cells, and differentiated cells implement different global gene regulation strategies, as reflected by the distribution of transcript levels: either broadly over many genes or narrowly focused on fewer genes. I adopted the Gini index, a measure of transcript counts inequality, as a new measure of each cell's stemness. Not surprisingly, Gini index is affected by the variation of total transcript counts among cells, as is the case of many other statistics. Through statistical modeling and simulation, I corrected the technical bias and recovered the true Gini index. As single-cell technologies are still new, there has been an exponential increase of computational tools. However, the community still struggles to identify the best tools for a given problem. Chapter 4 described my effort to systematically benchmark scRNA-seq clustering tools. Using in silico simulation I generated an ensemble of datasets with varying statistical properties and known cluster structures. By testing them on a series of modifiable workflows, I identified the strengths and weaknesses of current clustering tools, for data with distinct properties, and for specific data processing methods. Together, this dissertation represents efforts in developing and advancing computational scRNA-seq analytical tools to comprehensive explore the mechanism and evolution of spermatogenesis program in mammals.
Product Used
Oligo Pools
Related Publications