Week 2: Genotype Imputation: Process & Data Formats

March 17, 2023

Welcome to Week 2 of my Senior Project Blog! This week, I will cover genotype imputation in detail and the mechanics of how it’s done and the file formats in which such data is presented. This will be really important for the analysis and model building later on.

With the advances in genomics, the costs of genome sequencing have fallen substantially, generating a cornucopia of data that has made association studies and whole genome studies feasible and efficient. For example, the cost of sequencing a single human genome has fallen well below $1000 per person in 2021 from $100 million in 2001 which is even more dramatic than Moore’s law, which predicts that speed and capability in computing is expected to double every two years!

As we discussed in Week 1 post, genetic recombinations produce genetic variants in DNA sequences resulting from the combination of paternal and maternal DNA sequences. This often results in a set of genes with DNA variants along a single chromosome that tends to be inherited together called a haplotype. Using markers in the genome for a small fraction (typically 1–10%) of the entire DNA sequence we can infer useful information about traits and variations across the entire genome. The most commonly used markers are Single Nucleotide Polymorphisms (pronounced “snips”) which are single base pair mutations at specific locations where sequence alternatives (called alleles) exist and where an individual’s genome differs from the most common reference nucleotide (A, T, G, or C). Our genomes have millions of such SNP markers that can be used as mileposts to study genes and regions of interest that are associated with specific diseases for example. Gene chip technologies called SNP chips or bead arrays, enable us to use SNPs that appear consistently within the human population to study the underlying genes. SNP chips are bi-allelic as they have 2 alleles for each SNP location. For example, if there are two alleles A and B then we can have AA, AB, and BB as the possible genotypes where AA and BB are homozygous while AB is the heterozygous genotype as we inherit one copy from each parent. As an example, commercial SNP chips provide multi-ethnic genome-wide content with clinical research variants and quality control (QC) markers. These are based on programs such as NIH’s All of Us research program which engages over 1 million US volunteers and uses the health data to improve health outcomes and devise new treatments. Such chips are built on a high-density global SNP backbone and can be optimized for cross-population coverage. They are increasingly used for large-scale genomics and screening applications: for example, to develop polygenic risk scores and to characterize the genetic architecture in diverse populations. SNP chips use thousands of beads (wells) coated with probes targeting specific SNP genotypes at a particular location in the genome. These probes bind to DNA sequences in the sample and when excited by a laser emit a signal that conveys the allelic ratio at that specific genetic locus. Clustering algorithms take the data and convert them into homozygous or heterozygous groups and produce a text file output. Since the outputs from such SNP analysis result in large datasets software for processing and analyzing the output becomes critical in genomic studies.

Since there could be thousands of loci with SNPs in the human genome each with allele combinations, we need to use software to reconstruct the haplotypes from the observed SNPs so that we can assign the right alleles to the paternal and maternal chromosomes. This process is called phasing.

As outlined in Week 1, genotype imputation is the term used to describe the process of predicting the genotypes at the SNPs that are not directly genotyped in the study sample. In other words, it is the process of “filling in” with an educated guess missing genotypes that are not directly measured in a sample of individuals. Typically, a reference panel of haplotypes at a dense set of SNP loci is used in genotype imputation studies. Genotype imputation is a key technique in genome-wide association study (GWAS) or in a focused region (called fine-mapping study). Such imputation in silico can enhance the number of SNPs that can be used in association studies or improve fine-grained analysis or determine causal variants associated with diseases. In addition, imputation can be used for determining untyped variation, copy number variants, insertions/deletions, missing data, and correcting genotyping errors. The process of genotype imputation is illustrated below.

One of the most common text formats used in imputation studies is VCF or Variant Call Format which we will use extensively for both input and output formats. VCF format is a text format that has 3 sections: Meta-information, Headers, and Variants. Meta-infomation has various parameters such as versions, reference panels used, and other annotations which are primarily for documentation purposes and always begin with a “##”. The header section has column names which include the chromosome, position (in DNA sequence starting at beginning of chromosome), the unique ID of locus, various quality filters, the reference and alternate nucleotide for the SNP, followed by all the sample ids which represent the various individual human sample identifiers. In the example file below, NA00001-NA0003 represents the variant information for 3 different individuals. Below this column header is the actual variant data.

Now that we understand what genotype imputation is, how it works, the data that is generated, and the format of the text files, we will focus on model building and how the deep learning based models can be used to perform accurate genomic imputation at scale.

Thank you for reading, and see you next week!

Sources

“All of US Research Program Overview”. NIH. https://allofus.nih.gov/about/program-overview.
Dias, R., & Torkamani, A. (2019). Artificial Intelligence in clinical and Genomic Diagnostics. Genome Medicine, 11(1). https://doi.org/10.1186/s13073-019-0689-8.
“The Cost of Sequencing a Human Genome.” Genome.gov, https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost.
“Illumina Microarray Technology.” Illumina, https://www.illumina.com/science/technology/microarray.html.
Marchini, Jonathan, and Bryan Howie. “Genotype Imputation for Genome-Wide Association Studies.” Nature Reviews Genetics, vol. 11, no. 7, 2010, pp. 499–511., https://doi.org/10.1038/nrg2796.
TorkamaniLab. “Imputation_autoencoder/Example.vcf TorkamaniLab/imputation_autoencoder .” GitHub, 25 May 2022, https://github.com/TorkamaniLab/Imputation_Autoencoder/blob/master/test/example.vcf.

View more of Vishak S.'s posts.