Mapping the impact of genetic variation at a single-cell resolution using high dimensional Gaussian process models
- Project No: NCKIR4
- Intake: 2022 KIR Non Clinical
Supervisor: Luke Jostins-Dean
Co-Supervisor: Yang Luo
A key goal in the study of human disease genetics is to understand the impact of genetic variation on the phenotype of individual cells. Single cell sequencing techniques, such as single-cell RNA-seq (scRNA-seq), allow us to characterize gene expression and regulation at very high resolution. When applied to, for instance, inflammatory bowel disease (IBD), these experimental techniques have identified new cell types that are dysregulated in disease and have the potential to shed light on the cell types where genetic risk variants are active. New studies are coming online that integrate data across hundreds of genotyped individuals, in order to directly study how genetic variants impact gene expression of fine-grained cell types, by discovering single-cell expression quantitative trait loci (eQTLs).
However, in practice single-cell sequencing data is sparse, with only a handful of observations per gene, which introduces noise and censorship. Standard analysis approaches pool data across cells with similar expression profiles, which overcomes these issues but only at the expense of single-cell resolution. In this project, the student will develop new computational approaches that share data across cell types without collapsing them into clusters, to allow us to map the impact of genetic variants at a truly single-cell level and allowing us to answer the direct question “what is the impact of genetic variant X on expression of gene Y in specific cell Z”.
To answer this question, the student will develop a nonparametric statistical framework to allow true single-cell parameter estimation from single cell sequencing data. High dimensional Gaussian process models have successfully been used to model single-cell data. By setting a prior on the similarity of cell states as a function of their distance on an underlying latent space, the Gaussian process approach allows information to be shared across cells while still modelling each as a unique data point. The student will extend this approach to produce a generative model of single-cell gene expression across multiple individuals, with genotype-driven, individual-driven and cell-intrinsic sources of variation. This model will be used to estimate key parameters of biological interest (such as the predicted impact of a specific genetic variant on expression of a gene in a specific cell). The student will then apply this method to a range of datasets, including circulating immune cells and intestinal biopsy data, from both health and disease, in order to fine-map the impact of disease-associated genetic variants to individual cells. Further bioinformatic analysis of identified cells will be used to characterise the molecular pathways that these variants impact, to develop hypotheses for follow-up experiments with collaborators, and to suggest potential cell-types and molecules as drug targets.
Genetics, statistics, single-cell, machine learning, eQTLs
This project is well suited to a student with a background in statistical genetics, or a background in statistical modelling or machine learning who is interested in developing applied knowledge in the biological sciences. Training will be provided in R programming, if required, and in the principles and theory of statistical genetics and genome analysis. There will be opportunities to collaborate with both computational and experimental scientists, as well as receive training in cutting-edge analysis techniques for high-throughput genetic and genomic datasets.
- Gutierrez-Arcelus et al (2020). Allele-specific expression changes dynamically during T cell activation in HLA and other autoimmune loci Nat Genet. 52(3): 247–253. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7135372/
- Oelen et al (2021) Single-cell RNA-sequencing reveals widespread personalized, context-specific gene expression regulation in immune cells bioRxiv https://doi.org/10.1101/2021.06.04.447088
- Jostins et al (2012) Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease, Nature 491(7422):119-24
- Verma and Engelhardt (2019) A robust nonlinear low-dimensional manifold for single cell RNA-seq data. bioRxiv https://doi.org/10.1101/443044
- Titsias and Lawrence (2010) Bayesian Gaussian Process Latent Variable Model Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:844-851, 2010. http://proceedings.mlr.press/v9/titsias10a/titsias10a.pdf
Inflammation biology, computational biology