Week 3: Automating Data Collection and An Interesting Research Paper!
March 14, 2025
Thanks for joining me in week 3! Let’s get into some interesting science.
Thankfully, the Wynton HPC Cluster was working again on Monday! No more endless “queue, waiting” messages in the terminal. Hooray!
Last week, my mentor and I decided that I would try to run LoDEI on the white fat data to see how it compares to the brown fat results that I obtained. So, I carried out the same process to unpack and preprocess the white fat data as I did for the brown fat data – unzipping FASTQ files, aligning them to the reference genome, and generating BAM files. The process went smoothly this time, as I had hashed out all the errors with the brown fat data. Then, I re-ran LoDEI on both the brown and white fat datasets, this time with the revised parameters (the BAM files are unstranded, not stranded). The results I obtained are still inconclusive–LoDEI shows q-value significant differential editing from A to T and G to T nucleotides in the brown fat, and no significant differential editing in the white fat. Meeting with my mentor, we decided that I will try to find a public dataset consisting of 2 distinct ADAR1-related conditions in mice (ex: ADAR1 wildtype and ADAR1 deficient) and see if LoDEI produces anticipated A to G differential editing results. If it does, this might indicate an error in how the RNA-seq data I’m working with was collected in the first place. Furthermore, LoDEI was designed originally for human data, and it uses q-values as a ranking measure. It is possible that these design aspects are introducing incompatibilities with the lab’s data, which is mouse fat. For now, these are the results I have produced, and it is possible we may need to move on and focus more on the next (below!) project.
Onto exciting developments! This week, I made significant progress on the protein shuttling project. While last week, I developed a framework for how I plan to go about this project, the challenge I was confronted with at the beginning of this week was: I am using tools (like cNLS Mapper and UniProt protein sequence retriever) that have web interfaces. To test them out on a few proteins, I can simply search up uniprot IDs in the UniProt website, copy-paste them into cNLS Mapper, obtain HTML results, and visually inspect the Nature phosphoproteome database to identify phosphosites located in/next to the predicted NLSs identified by cNLS Mapper. How do I automate the process of finding significant phosphosites in the massive (10,000+ proteins!) Nature databank that I’m working with?
The answer: APIs and a lot of web-scraping! I used the UniProt API to generate fasta files for the 10,000+ proteins in the Nature dataset. From there, inputting and retrieving data from cNLS Mapper was a challenge because it doesn’t have an API. So, I wrote code to mimic a user submitting a form (with the fasta files I generated) on the website for all of the proteins. Next, I saved the resulting HTML file results (I used parallel processing to speed this up). Then, I used web-scraping to extract the necessary data from these HTML files: specifically, the predicted NLS sequences, their locations, their confidence measures, and their class (monopartite or bipartite). Finally, I wrote a Python function to identify phosphorylated serine/threonine sites in the Nature dataset that lie within or near these NLSs. I obtained a total of 9,136 significant phosphosites across 2,746 unique proteins from this analysis. After sharing these results with my mentor, we decided to now check the sub-cellular localizations of the proteins that contain these significant phosphosites using the Human Protein Atlas (the proteins must be localized to the nucleus, cytosol, or both to be relevant), and the next exciting project will be to conduct sequence motif analysis (finding consensus phosphorylation sites in these proteins!) By doing this, we can better understand if these are certain conserved sequences related to nuclear import that protein kinases operate on to regulate protein shuttling. As I produce more filtered data, new research questions open for how we can use this data to investigate biological phenomena.
And a serendipitous happening: I stumbled across an insightful research article from the Koch Institute (published just 4 days ago!). Researchers developed a new machine-learning tool, ProtGPS (https://github.com/pgmikhael/protgps), to predict protein localization in the cell from its amino acid sequence. It was a very interesting read, and I’m starting to think about ways to apply this tool to the Nature dataset I’m analyzing. In this way, my project can not only identify NLS sequences in proteins but relate these sequences to proteins’ predicted intra-cellular localizations.
Until next week!
Leave a Reply
You must be logged in to post a comment.