Week 1: Introducing My Senior Project

March 10, 2023

Hi, my name is Vishak, and welcome to Week 1 of my Senior Project Blog! This week, I will cover some background information and prior research, and introduce my project and advisors.

My senior project is focused on developing an approach based on computational genomics to predict an individual’s risk for various common heritable diseases such as Coronary Artery Disease (CAD).

Motivation & Significance

CAD is the single largest cause of mortality and disability globally with extremely high incidence in low and middle-income countries. Globally, CAD accounts for over 7 million deaths and loss of over 129 million life-years annually and is implicated in more than a third of deaths in individuals older than 35 years of age. Individual risk prediction estimates the probability that an individual is susceptible to disease, and is a key activity in clinical decision-making for diseases like CAD. Such predictions can serve as early warning signals, and help with the prevention or delay of the onset of diseases. Moreover, when communicated and understood effectively, it can be a powerful tool for personal health management.

Background

We inherit our genes in large blocks from our parents’ genomes through a DNA sequence shuffling process called recombination. The recombinations produce breakages across blocks of genetic information, resulting in correlations across genes. This process produces variants in genome sequences. Using the observed genetic sequence for a small fraction (typically 1–10%) of genetic variants we can infer those of known genetic variation sites across the entire genome. This process of inferring the missing genotype information is called genotype imputation which enables us to create the known common genetic variation without the prohibitive costs of direct sequencing of the whole genomes of individuals. It’s like a genetic Sudoku game! Genotype imputation becomes an essential tool to perform association studies across the whole genome or to implement population health initiatives based on genotypes given the huge costs of studying genotypes through genome sequencing across the entire population of individuals.

The classic approach to genotype imputation uses Hidden Markov Models (HMM) based on large-scale whole genome sequence reference panels. HMM-based algorithms for imputation are computationally intensive, requiring specialized high-performance computing clusters and the raw genomic data of individuals needs to be provided through the whole genome sequence reference panels resulting in privacy and scalability concerns.

Artificial neural network-based models are revolutionizing biomedical informatics. Recently, a special type of artificial neural network, called autoencoders, has been identified as a promising alternative to expensive HMM-based algorithms based on their ability to fill in missing data for digital image restoration. Autoencoders are neural networks that excel at simply reconstructing the original input data, so they are used in various applications where we need to reduce the number of inputs to a model (dimensionality reduction of a large number of input variables) or compress data or need to perform de-noising or de-masking. Denoising autoencoders take corrupted or masked data (to preserve patient privacy) as input and can be trained to predict the original uncorrupted data as the output.

Dr. Torkamani’s lab at Scripps Research Translational Institute, where I am performing my research project, has created a generalized approach for highly accurate genotype imputation using denoising autoencoders. Our lab has achieved better accuracy relative to prior algorithms in a computationally efficient manner without requiring the distribution or sharing of personal genomic data via reference panels.

My Project

In my senior project, I plan to analyze existing denoising autoencoders for imputation that were developed in Dr. Torkamani’s lab. I plan to apply the learnings from analyzing the prior autoencoders developed for human Chromosome-22 to the recently created autoencoder models for CAD and build a generalized model to predict the risk score for an individual developing CAD based on the imputed genomic data.

My Advisors

For this project, my internal advisor is Mrs. Bhattacharya, and my external advisor is Dr. Salvatore Loguercio, a staff researcher in Dr. Torkamani’s lab at Scripps Research Translational Institute (SRTI).

Thank you for reading, and see you next week!

Sources

Browning, B. L., Zhou, Y., & Browning, S. R. (2018). A one-penny imputed genome from next-generation reference panels. The American Journal of Human Genetics, 103(3), 338–348. https://doi.org/10.1016/j.ajhg.2018.07.015
Dias, R., Evans, D., Chen, S.-F., Chen, K.-Y., Loguercio, S., Chan, L., & Torkamani, A. (2022). Rapid, reference-free human genotype imputation with denoising autoencoders. ELife, 11. https://doi.org/10.7554/elife.75600
Dias, R., & Torkamani, A. (2019). Artificial Intelligence in clinical and Genomic Diagnostics. Genome Medicine, 11(1). https://doi.org/10.1186/s13073-019-0689-8
Ghosh, S. K., Biswas, B., & Ghosh, A. (2019). Restoration of mammograms by using deep convolutional denoising auto-encoders. Advances in Intelligent Systems and Computing, 435–447. https://doi.org/10.1007/978-981-13-8676-3_38
Li, Y., Willer, C., Sanna, S., & Abecasis, G. (2009). Genotype imputation. Annual Review of Genomics and Human Genetics, 10(1), 387–406. https://doi.org/10.1146/annurev.genom.9.081307.164242
Ralapanawa, Udaya, and Ramiah Sivakanesan. Epidemiology and the Magnitude of Coronary Artery Disease and Acute Coronary Syndrome: A Narrative Review. Journal of Epidemiology and Global Health, vol. 11, no. 2, 2021, p. 169., https://doi.org/10.2991/jegh.k.201217.001
Sarkar, E., Chielle, E., Gursoy, G., Mazonka, O., Gerstein, M., & Maniatakos, M. (2021). Fast and scalable private genotype imputation using machine learning and partially homomorphic encryption. IEEE Access, 9, 93097–93110. https://doi.org/10.1109/access.2021.3093005

View more of Vishak S.'s posts.