Week 11: Wrap Up And Summary
May 19, 2023
Welcome to Week 11 of my Senior Project Blog! This week I will summarize my senior project and cover the next steps/future work.
My senior project is focused on the study of genotype imputation and its application to studying gene loci for heritable diseases such as Coronary Artery Disease (CAD). CAD is the single largest cause of mortality and disability accounting for over 7 million deaths annually and more than a third of deaths in individuals older than 35 years of age. Predicting an individual’s risk for CAD can serve as early warning signals and prevention or delay of the onset of diseases.
Genotype imputation is the process of using the observed genetic sequence for a small fraction (typically 1–10%) of genetic variants to infer those of known genetic variation sites across the entire genome like a genetic Sudoku game. Genotype imputation becomes an essential tool to implement population health initiatives based on genotypes given the huge costs of studying genotypes through genome sequencing across the entire population of individuals.
The classic approach to genotype imputation uses probabilistic Hidden Markov Models (HMM) based on large-scale whole genome sequence reference panels. This process is computationally intensive, and requires specialized high-performance computing clusters and the raw genomic data of individuals to be shared resulting in privacy and scalability concerns. Recently, a special type of artificial neural network, called autoencoders, has been identified as a promising alternative to HMM-based algorithms based on their ability to fill in missing data for digital image restoration. Autoencoders are neural networks useful for reconstructing the original input data or generating compressed representations or performing de-noising or de-masking. Our lab has created a model for highly accurate genotype imputation using denoising autoencoders (DAE) which have much higher accuracy and are 4 times faster than prior methods without requiring the distribution or sharing of sensitive genomic data. In my senior project, I analyzed two classes of pre-trained models that were developed in our lab at Scripps Research for the imputation of human Chromosome-22 (one of the smallest of our chromosomes which holds several genes relevant to Immune system, congenital heart disease, schizophrenia, birth defects & cancers) and for genetic loci for CAD across the entire genome. The imputation DAE models for both CAD loci and Chromosome-22 regions were constructed based on genome sequences for tiles or fragments of the relevant portion of the genomes resulting in 256 tiles for Chromosome-22 and 184 for CAD. Models for Chromosome-22 and CAD were trained trained on HRC panel (Human Reference Consortium) with genotype data from 32K individuals. The validation datasets used were from Atherosclerosis Risk in Communities Study (ARIC) and the test datasets were from Multi-Ethnic Study of Atherosclerosis (MESA), Human Genome Diversity Project (HGDP), and WEllderly study from Scripps.
I analyzed various structural properties of the models such as the number of layers in each of AE models, the activation functions used for the layers, and size ratios(ratios of number of neurons in bottleneck layers to input layer). For the Chromosome-22 models, ~240 out of the 256 models do not reduce dimensions of inputs (no hourglass shape in the autoencoder structure) and ~88% of models have 4 or 6 layers. In the case of CAD, for each tile there were upto 10 models which had the best accuracy for a total of 1793 models across the entire genome. ~65% of these models do not reduce dimensions of inputs meaning the network had no hourglass shape and ~40% of models had 8 layers. Examining the structure of the models proved me insights into whether the models created a sparse latent representation (the compressed output of the last encoder layer) or did not result in any sparser representation that preserved the input characteristics.
To examine the latent representations of each model, I then ran the processing pipeline based on the validation datasets to extract the embeddings or latent representation while running the models using forward hooks available in PyTorch. I was able to successfully extract the latent representations and store them for further processing such as for use in a generalized model to predict the risk score for an individual developing CAD based on the imputed genomic data.
My work with the model analysis and embeddings extraction has provided a scalable method to run processing pipeline code with new inputs on pre-trained models and extract intermediate layer outputs and identified a few avenues for optimization that will be pursued in the future by our lab. My senior project has demonstrated a robust approach to extracting embeddings with forward hooks while reprocessing the modeling pipeline with new data. Using these extracted embeddings, in the future, one can develop a risk score given an individual’s genomic profile.
Thank you for reading, and following along for the past few weeks!
Sources
- Baskar, Nanditha. “Intermediate Activations – the Forward Hook.” Nandita Bhaskhar/ Stanford University, 17 Aug. 2020, https://web.stanford.edu/~nanbhas/blog/forward-hooks-pytorch/.
- Browning, B. L., Zhou, Y., & Browning, S. R. (2018). A one-penny imputed genome from next-generation reference panels. The American Journal of Human Genetics, 103(3), 338–348. https://doi.org/10.1016/j.ajhg.2018.07.015
- Dias, R., Evans, D., Chen, S.-F., Chen, K.-Y., Loguercio, S., Chan, L., & Torkamani, A. (2022). Rapid, reference-free human genotype imputation with denoising autoencoders. ELife, 11. https://doi.org/10.7554/elife.75600
- Dias, R., & Torkamani, A. (2019). Artificial Intelligence in clinical and Genomic Diagnostics. Genome Medicine, 11(1). https://doi.org/10.1186/s13073-019-0689-8
- “Forward and Backward Function Hooks – NN Package¶.” Nn Package – PyTorch Tutorials 2.0.0+cu117 Documentation, PyTorch , https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks.
- Ghosh, S. K., Biswas, B., & Ghosh, A. (2019). Restoration of mammograms by using deep convolutional denoising auto-encoders. Advances in Intelligent Systems and Computing, 435–447. https://doi.org/10.1007/978-981-13-8676-3_38
- Li, Y., Willer, C., Sanna, S., & Abecasis, G. (2009). Genotype imputation. Annual Review of Genomics and Human Genetics, 10(1), 387–406. https://doi.org/10.1146/annurev.genom.9.081307.164242
- Ralapanawa, Udaya, and Ramiah Sivakanesan. Epidemiology and the Magnitude of Coronary Artery Disease and Acute Coronary Syndrome: A Narrative Review. Journal of Epidemiology and Global Health, vol. 11, no. 2, 2021, p. 169., https://doi.org/10.2991/jegh.k.201217.001
- Sarkar, E., Chielle, E., Gursoy, G., Mazonka, O., Gerstein, M., & Maniatakos, M. (2021). Fast and scalable private genotype imputation using machine learning and partially homomorphic encryption. IEEE Access, 9, 93097–93110. https://doi.org/10.1109/access.2021.3093005
Leave a Reply
You must be logged in to post a comment.