Week 3: Autoencoders For Genomic Imputation
March 17, 2023
Welcome to Week 3 of my Senior Project Blog! This week, I will cover deep learning models called autoencoders that I will use for genotype imputation in detail and the mechanics of the pipeline process and model building.
As we saw in Week 2, genotype imputation has become an essential step in enhancing the power of genome-wide association studies, identifying functional SNPs or causal genetic variants, and discovering disease trait-associated loci in our genome. As missing nucleotides in genotype data can considerably limit the discovery or inference process, it becomes necessary to impute untyped or missing variants.
Current genotype imputation methods can be divided into two types: reference-based and reference-free approaches. The reference-based genotype imputation methods need a large-scale reference panel of individual sample genome data such as Trans-Omics for Precision Medicine (TOPMed which includes over 97,000 high-coverage genomes) and use observed genotypes from an SNP array to match DNA segments shared between a target sample with missing values from a reference panel dataset (Song et. al 2020). The most common approaches used in reference-based imputation methods use Hidden Markov Model (HMM). Examples include IMPUTE, BEAGLE, Minimac, and MACH. There are also web-based imputation services such as the TOPMed Imputation Server, the Michigan Imputation Server, and the Sanger Imputation server (https://www.sanger.ac.uk/tool/sanger- imputation-service/) that rely on the same methods. However, these methods have two huge drawbacks: the large computational costs needed for imputing a large number of samples against a reference panel and the difficulty of obtaining consent to share individual genome data to perform imputation tasks (Das et al., 2018).
In contrast, reference-free imputation methods do not need a large reference panel dataset. AI-based models, which represent the state of the art for this class of methods, have revolutionized reference-free imputation methods because of their ability to accommodate large datasets and model highly non-linear relationships. Such methods can range from classical machine learning algorithms such as singular value decomposition (SVD), k-nearest neighbors (KNN), and random forest (RF) or more sophisticated deep learning (DL) methods that are derived from image restoration and inpainting tasks.
Among DL methods, Autoencoders (AE), a type of neural network that excels at simply reconstructing the original input data in order to achieve desired goals like dimensionality reduction or compression, and de-noising or de-masking are especially suited for genotype imputation tasks. For example, AEs take have corrupted or masked data as input and can be trained to predict the original uncorrupted data as the output. Such AEs are called Denoising Autoencoders (DAEs). These autoencoder characteristics are well-suited for genotype imputation and can address some of the limitations of dissemination of reference panels to perform imputation on new samples and capture complex non-linear relationships across genomic regions.
Denoising Autoencoders
Denoising Autoencoders (DAEs) are unsupervised artificial neural networks that learn a low-dimensional latent (or hidden) space representation from high-dimensional input data. DAEs reconstruct output data from the learned representation using two components: an encoder and a decoder. The structure of an AE is shown below:
The decoder usually has an inverted symmetric structure to the encoder (i.e.) the number of nodes for the layers in the encoder usually decreases while the number of nodes for the decoder increases back to the number of the input. Among the different types of AE structures, a denoising AE receives corrupted data by injecting some noise into the original input and predicts the uncorrupted output. If we corrupt the input genotype data with some missing values, the denoising AE is able to recover these missing values for genotype imputation.
DAE Models for Imputation: Approach and processing pipelines
Our lab at Scripps Research has developed a generalized approach to human genotype imputation using sparse DAEs capable of highly accurate genotype imputation (98+%). Prior work in our lab built DAE models starting with all bi-allelic SNPs from Haplotype Reference Consortium (HRC) dataset which has 27,165 samples and 39,235,157 bi-allelic SNPs generated using whole-genome sequence data from 20 studies of predominantly European ancestry (83.92% European, 2.33% East Asian, 1.63% Native American, 2.17% South Asian, 2.96% African, and 6.99% mixed ancestry individuals). Each individual DAE was designed to span an independent genomic segments known as “tiles” with boundaries defined by recombination hotspots from HRC dataset so each segment fits into the video memory of GPUs used in the compute cluster. Each DAE receives masked data as input and is trained to predict the original uncorrupted data as the output. Each bi-allelic SNP was encoded as two binary input nodes, representing the presence or absence of each allele. The basic architecture of the DAE model is shown in figure below.
For example, one of the class of DAEs was trained for all 510,442 unique SNPs observed in HRC on human Chromosome-22. The imputation results are output in variant calling format (VCF) that we discussed in Week2 with both the imputed genotypes and quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference AA, heterozygous AB, and homozygous alternate allele BB).
Now that we understand how autoencoders can be used for genotype imputation, I will analyze the structure and explore other characteristics of DAE models of Chromosome-22 in the upcoming week.
Thank you for reading, and see you next week!
Sources:
Browning, B. L., Zhou, Y., & Browning, S. R. (2018). “A one-penny imputed genome from next-generation reference panels”. The American Journal of Human Genetics, 103(3), 338–348..
Dias, R., Evans, D., Chen, S.-F., Chen, K.-Y., Loguercio, S., Chan, L., & Torkamani, A. (2022). “Rapid, reference-free human genotype imputation with denoising autoencoders”. ELife, 11. https://doi.org/10.7554/elife.75600. Dias, R., & Torkamani, A. (2019). “Artificial Intelligence in clinical and Genomic Diagnostics”. Genome Medicine, 11(1). https://doi.org/10.1186/s13073-019-0689-8.
Das, Sayantan, et al. “Genotype Imputation from Large Reference Panels.” Annual Review of Genomics and Human Genetics, vol. 19, no. 1, 2018, pp. 73–96., https://doi.org/10.1146/annurev-genom-083117-021602.
Song, Meng, et al. “A Review of Integrative Imputation for Multi-Omics Datasets.” Frontiers in Genetics, vol. 11, 2020, https://doi.org/10.3389/fgene.2020.570255.
TorkamaniLab. “Imputation_autoencoder/Example.vcf TorkamaniLab/imputation_autoencoder.” GitHub, 25 May 2022, https://github.com/TorkamaniLab/Imputation_Autoencoder/blob/master/test/example.vcf.
Week 4: Analyzing Denoising Autoencoders (DAEs) For Chromosome-22 Imputation
Mar 31, 2023
Welcome to Week 4 of my Senior Project Blog! This week, I will cover how I analyzed the structure of Denoising Autoencoder (DAE) models that were built in our lab for Chromosome-22.
As we saw previously in Week 3, DAEs are unsupervised artificial neural networks that learn a low-dimensional latent (or hidden) space representation from high-dimensional input data.
DAE Models for Chromosome-22 Imputation
Our lab at Scripps Research has developed a generalized approach to human genotype imputation using trained denoising autoencoders (DAE) on 256 fragments or “tiles” of chromosome 22. The models have superior imputation accuracy (90+%) compared to other methods like Hidden Markov Models that we saw in Week 1 blog and can perform four times faster on imputation tasks. These DAE models were built with HRC panel data using PyTorch, an open-source machine learning framework developed by Facebook that allows us to create our own neural networks and optimize them efficiently. While other alternatives to PyTorch such as TensorFlow (developed by Google), JAX, and Caffe are also popular, PyTorch is well established, has a huge developer community, is very flexible, and is specially used by lots of ML researchers. If we understand one of these frameworks it becomes easy to learn and port code between these frameworks because they use similar concepts and ideas.
The most fundamental concept in neural network-based learning frameworks is that of tensors which are the equivalent to NumPy arrays (from Python’s NumPy library), which are the homogenous multidimensional table of elements (usually numbers), all of the same types, indexed by a tuple of non-negative integers. The name “Tensor” is a generalization of multidimensional arrays: a vector is a 1-D tensor, a matrix is a 2-D tensor, and so on. When working with neural networks, we will commonly encounter tensors of various shapes and dimensions. Tensors have support for GPU acceleration which is a critical feature of PyTorch. A GPU can perform many thousands of small operations in parallel, making it efficient for performing large matrix operations that are required to build and train neural networks. GPUs can accelerate the training of neural networks up to a factor of 10X. For interested readers, a tutorial of using PyTorch for using Autoencoders to analyze and reconstruct images from CIFAR dataset is available here
Each tile or fragment model has a specific DAE network architecture. An example is shown below:
In the example, fragment #1 which spans the genomic loci from 19911923-20228616 on Chromosome 22 is represented by an encoder that has 2 linear layers (1 input layer followed by 1 hidden layer) each of which has 9922 inputs and 9922 outputs with hyperbolic tangent (tanh) activation function to capture the non-linearities. The output of the decoder is the compressed or latent representation which is then fed to a symmetric decoder block which takes this representation and expands it using a 2-layer decoder.
The best model weights after training for each tile are located in subfolders which are named after each fragment. Inside each folder, underneath another sub-folder, there are 3 files: a params.py file (which holds model parameters), a position file with a .pos extension (a position file indicating the imputations in that range of loci), and a file with a .pth extension which holds the best PyTorch model with weights.
Each model parameters file contains a number of informative lines specifying various parameters such as the number of layers, the activation functions used, and the number of training steps. An example top portion of a fragment spanning 28166665-28198163 loci on Chromosome-22 is shown below:
The above model has 8 layers with a size ratio (# of outputs/# of inputs) of 0.7 and uses leakyRELU type of activation function.
To load each tile model and determine its structure, I wrote a python function that takes as input the fragment name and the 3 files (position, PyTorch model, and parameter files) present in each subfolder using PyTorch’s built-in functions. It first reads the parameter files to identify how many SNPs are present in each position file corresponding to a tile. Then it load the state dictionary (state-dict) which stores in a dictionary (key, value) format the model parameters and other characteristics such as weights and biases so that the final trained neural network can be loaded into memory. The state_dict from the model contains all learnable model parameters such as layer weights, number of layers, size ratios, and activation functions for each layer.
Using the above approach, I analyzed all the 256 tile fragment models to determine their network architectures.
What I found interesting was how the best models did not reduce dimensions of inputs (i.e. size ratio was mostly 1.0) which means you needed all the inputs to represent the knowledge embodied in the SNP data for that fragment and the decoder and encoders did not have the funnel shape but were more like a flat tube where each layer had identical number of inputs as outputs. Among the best models, Tanh is the most common activation function and most of the best trained models only have 4 or 6 layers (which is 2 or 3 layers each for the encoder and decoder) and were symmetric (encoder and decoder were mirror images in terms of layers).
Now that I have learned how to load and analyze the DAE models for Chromosome-22, I will analyze the structure more deeply in the upcoming weeks to identify whether more compressed representations are possible using techniques of dimensionality reduction.
Thank you for reading, and see you next week!
Sources
- Dias, R., Evans, D., Chen, S.-F., Chen, K.-Y., Loguercio, S., Chan, L., & Torkamani, A. (2022). “Rapid, reference-free human genotype imputation with denoising autoencoders”. ELife, 11. https://doi.org/10.7554/elife.75600.
- Lippe, Phillip. “University of Amsterdam Deep Learning Tutorials.” Welcome to the UvA Deep Learning Tutorials! – UvA DL Notebooks v1.2 Documentation, 2022, https://uvadlc-notebooks.readthedocs.io/en/latest/.
- “Numpy Quickstart.” NumPy Quickstart – NumPy v1.25.dev0 Manual, 2023, https://numpy.org/devdocs/user/quickstart.html.
- “PyTorch Tutorials.” Welcome to PyTorch Tutorials – PyTorch Tutorials 2.0.0+cu117 Documentation, 2023, https://pytorch.org/tutorials/.
- TorkamaniLab. “Imputation_autoencoder/Example.vcf TorkamaniLab/imputation_autoencoder .” GitHub, 25 May 2022,ttps://github.com/TorkamaniLab/Imputation_Autoencoder/blob/master/test/example.vcf.