Week 7: Investigating Clinical Significance
April 10, 2025
Hello! Welcome back to my Senior Project. This week, I made significant progress in an exciting, clinically relevant area: identifying and characterizing SNPs that are located within phosphorylation sequences!
First, let’s recap what my protein translocation project has been addressing so far so that you can see where it leads. I began with a Nature phosphoproteome dataset, which aggregates thousands of mass spectrometry experiments to comprehensively characterize phosphosites across the proteome. Using the tool cNLS Mapper, I predicted the monopartite and bipartite nuclear localization sequences (NLSs) of these proteins, which are sequences involved in regulating the protein’s transport from the cytoplasm to the nucleus. I also filtered this dataset to only include proteins that have been characterized by the literature to have dual cellular localization. From there, I identified phosphosites (phosphorylated protein residues) located within or near these NLSs, constructed their sequence motifs, and predicted the kinases that act on these sequences. Finally, I built kinase tree plots to visualize these results.
With these filtered datasets and results, the next question is: how do we connect this phosphosite information to actual disease phenotypes? As I explained last week, one key method of doing this is by identifying single nucleotide polymorphisms (SNPs) near these phosphosites—these single-base variations can impair the kinase’s ability to recognize and phosphorylate that NLS-implicated sequence, which can impair the ability of the protein to move between the cytoplasm and nucleus to carry out its normal biological function.
This week, I spent a significant amount of time writing code to retrieve and map SNP data from the gnomAD database to my phosphorylation sequence dataset. But before I could even do this, a very careful consideration needed to be made: my phosphorylation sequence dataset is in terms of protein residue coordinates, not genomic coordinates, whereas SNP data is always in genomic coordinates! Therefore, I had to carry out a few extra steps to convert my protein coordinates into genomic coordinates—I first had to retrieve the protein Ensembl IDs and use the Ensembl API to retrieve the corresponding genomic coordinates of my protein phosphorylation sequences. This process involved a lot of trial and error—bugs in the API calls, unavailable data for a few proteins, and validating the results on a few test cases. After that, I used the Selenium Python package to automate the retrieval of SNP data for 800+ genes (the genes that my phosphosites are located within), leaving me with 800+ CSV files of characterized SNPs and their protein consequences. Then, I scanned these CSVs for SNPs that occur within phosphorylation sequences in my phosphosite dataset, and returned the results as a new CSV file with extensive information on each SNP, including its protein consequence, rsID, clinical classification (benign, pathogenic, unknown, etc.), and CADD (predicted deleteriousness level). My resulting file had around 60,000 SNPs!
After meeting with my mentor, we decided on filters we would use to narrow down this list further—we are looking for a small number of SNPs that actually have clinical value and that we should test further in the lab. So, I removed SNPs that did not change the resulting amino acid (silent mutations) as well as SNPs that cause early termination of the protein (nonsense mutations). I also removed frameshift mutations, as we are specifically looking for missense mutations that might be clinically significant. At the end, I produced a list of around 25,000 SNPs. Next week, I will work with my mentor to understand other methods of filtration we can use. We will narrow down this list to focus mostly on pathogenic or likely pathogenic SNPs, cross-check them with known data sources, and maybe develop a plan to test them experimentally.
Thanks for reading!
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.
This is such an impressive update! It’s amazing how much work goes into making the data usable and meaningful for clinical research. Can’t wait to see what comes next!