Week 8: Running Imputation Pipelines To Extract Embeddings
Welcome to Week 8 of my Senior Project Blog! This week the progress was a bit slower as I had to figure out how to identify the correct input files that I needed to run the CAD models for specific tiles or fragments before I can extract the embeddings for any layer. I will outline the structure of the CAD input vcf file structure and explain how I was able to find the matching input vcf file for a fragment.
AE Imputation Models for CAD
To recap the main ideas from the prior weeks, statistical approaches to genomic imputation create computational and privacy risk barriers, especially in regions with large and complex linkage disequilibrium. Our lab has developed an autoencoder-based approach to genomic imputation which offers superior imputation accuracy and 4X faster imputation. Our lab has created Autoencoder models for 184 fragments or tiles with CAD loci across all the human chromosomes based on genetic imputation.
Each vcf input file used to train the CAD loci model for each fragment has the structure shown below. The first column (#CHROM) represents the chromosome and the 2nd column represents the position (POS) within that chromosome where the CAD locus is present. The reference and alternate nucleotide and each individual sample are shown in subsequent columns as shown below.
To run the imputation pipeline, we need to load each model for the tile or fragment from the CAD model PyTorch files and locate the corresponding input vcf files used to train the specific models for each fragment. Since there were multiple models and the training and pipelining processes were run on multiple occasions, I had to write a python script that can first find if the corresponding input vcf file exists for the AE models and copy it locally after extracting from a zip archive.
First, we extract the name of the chromosome and tile location and then navigate to the root path where the ARIC input vcf files are stored and look for a corresponding vcf with same name pattern (<Chromosome number>_<fragment>). Then we uncompress the vcf and copy it over to my working area. The code and the console output for the python script are shown below. (Note: To protect sensitive information I have blanked out parts of the folder names in the code below)
Now that I have learned how to extract the vcf input files used to train the AE models, next week, I will explore how to extract the latent representations and embeddings for each model using the input vcf files that I identified.
Thank you for reading, and see you next week!
- Dias, R., Evans, D., Chen, S.-F., Chen, K.-Y., Loguercio, S., Chan, L., & Torkamani, A. (2022). Rapid, reference-free human genotype imputation with denoising autoencoders. ELife, 11. https://doi.org/10.7554/elife.75600.
- Lippe, Phillip. “University of Amsterdam Deep Learning Tutorials.” Welcome to the UvA Deep Learning Tutorials! – UvA DL Notebooks v1.2 Documentation, 2022, https://uvadlc-notebooks.readthedocs.io/en/latest/.
- TorkamaniLab. “Imputation_autoencoder/Example.vcf TorkamaniLab/imputation_autoencoder .” GitHub, 25 May 2022, https://github.com/TorkamaniLab/Imputation_Autoencoder/blob/master/test/example.vcf.