Week 9: Coding & Working on the Poster
May 1, 2025
Hi everyone! Welcome back to my Senior Project.
First off this week, I added some key information to the methodological flowchart that I made last week. Namely, I added intermediate statistics (number of proteins/phosphosites at each step) so that readers can clearly see how the filtration method is progressing. I also added relevant statistics for the two SNPs finding methods–SNPs that correspond with phosphosites, and SNPs located in phosphosites’ flanking sequences. Overall, this work helped make the flowchart more informative and useful.
This week also had some challenges with regard to the coding aspect. One of the things I have learned over the course of this project is how much iteration is necessary in programming and research in general. After obtaining SNPs results, the next key step was to compare them to positive control data. The positive control we are using is the gene HDAC4, known to be involved in nuclear import and to contain SNPs near NLS regions. For one, this week, I spent a lot of time researching HDAC4 – a key paper I read and annotated was “Missense substitutions at a conserved 14-3-3 binding site in HDAC4 cause a novel intellectual disability syndrome” (Wakeling et al.). I also looked at HDAC4 mass spectrometry data, which gave me a sense of the functional scores/importance of different amino acid sites in the protein. My mentor also shared a list of SNPs in HDAC4 from the ClinVar database that have known pathogenic consequences.
Comparing this provided list to the SNPs produced from my computational analysis, I found none of them to be present in my results! This signaled that something was wrong, either in my code or in the databases I was using. After a lot of troubleshooting and investigation, I found a few things–the API I was using to extract and download SNPs data was connected to the gnomAD database. This database provides external links to SNPs information, but unfortunately is not a comprehensive database, meaning that it did not contain a full list of SNPs for every gene. And, turns out, it was missing the pathogenic SNPs of HDAC4 that my mentor shared from the ClinVar database. Furthermore, going back to the code I wrote, I realized that not all of the SNPs data for the 4,000+ proteins had been downloaded as I intended due to the overwhelming number of requests to the server.
These observations made me go back to my workflow to see how I could improve it. For one, I decided to re-download SNPs data from the ClinVar database (instead of gnomAD), which has a comprehensive list of SNPs and their functional/clinical consequences (it is where my mentor found the positive control data for HDAC4). This code took a few days to run, as I increased the timeout between one gene and the next to ensure all the SNPs data was downloaded. Next, I decided to simplify the workflow I was using to find SNPs located in phosphorylation sequences, seeing as my current approach was missing out on some key known SNPs. Before, I was converting the phosphosite data (which is in terms of amino acid coordinates) into genomic coordinates, and then finding SNPs whose genomic coordinates fall within these regions. However, I realized that I could directly find the SNPs protein consequence data, which includes the amino acid residue they impact, and thus compare the phosphosite data and SNPs data directly without going through the laborious process of converting into genomic coordinates! This made the workflow a lot simpler and more intuitive.
In addition to running this code, I worked on interpreting my results biologically. What this means is, after getting the SNPs results, I investigated the functions of the genes they are located within to identify potentially interesting SNPs for further investigation. One interesting result I found was a SNP (rs672601307) located in the gene USP8, which is involved in membrane trafficking and cargo sorting–as my entire investigation centers around protein import and export, this gene could play a pivotal role. Additionally, I worked on planning my poster and assembling the necessary materials for it–the literature review I did at the beginning, the flowchart, and some of the results I have mentioned here. Currently, I am making a Google Docs outline of my poster, which I will later convert into the poster format.
Thanks for reading!
Leave a Reply
You must be logged in to post a comment.