Week 2: Detecting RNA Editing Sites and Understanding Protein Shuttling
March 6, 2025
Hello, and welcome back to my Senior Project! This week was full of reading papers, writing Bash and Python scripts, and brainstorming algorithmic workflows (as well as discovering new music genres on BART!).
Let’s dive right in.
For my ADAR1 RNA-editing site detection project, I realized that I generated some files incorrectly last week. The reason for this is that the GFF file (a descriptive file with genome annotations) that I was using was for the wrong mouse genome build. So, I first downloaded the correct GFF file from the Ensembl database. Then, I re-generated the reference genome indices and re-created the sequencing BAM files for the knockout and wildtype ADAR1 brown fat cells. These scripts took a few days to run. From there, I was finally ready to use the LoDEI (local differential editing index) tool to identify differential A to G RNA-editing sites between knockout and wildtype conditions! This process involved multiple iterations–for example, I learned that LoDEI only handles single-read RNA-seq data, so I had to modify its code to handle the lab’s paired-end data. I also had to process the mouse reference genome’s GFF file (removing the sex chromosome annotations) so that it was compatible with the LoDEI code.
I obtained preliminary results using LoDEI, showing differential A to T editing between knockout and wildtype ADAR1 brown fat cells. However, these results are not fully in line with expectations of ADAR1 editing–the A to G differential editing sites (the type of editing ADAR1 does) that I identified have high false discovery rates, which is unexpected. After meeting with my mentor, Dr. Wang, we decided that I will test LoDEI on the white fat cells to see how they compare to the brown fat results. Also, Dr. Wang shared details regarding how the mouse fat samples were prepared for RNA-sequencing. Learning about these, I realized there are some parameter adjustments that I need to make to LoDEI to run the code correctly (importantly, the RNA-seq data is unstranded, not reverse-stranded as I had assumed). So, while I have shown a basic proof-of-concept working of LoDEI, the results will likely be revised over the next week.
Switching gears to the protein translocation project I introduced last week! I read many lengthy (50+ page) papers on the topic, which helped me build background knowledge on pathways of nuclear import and export of proteins and how phosphorylation regulates these shuttling events. My mentor shared a Nature mass spectrometry dataset consisting of phosphosites across the entire proteome. After a lot of brainstorming, I devised a multi-step methodology to analyze this dataset to identify phosphorylation-regulated shuttling proteins, specifically focusing on shuttling events from the cytoplasm to the nucleus. I will first identify nuclear localization signals (NLSs) in protein amino acid sequences and then find phosphorylated serine/threonine residues near or within these sequences. As part of this workflow, I will use a bioinformatics tool called cNLS mapper (which predicts NLSs in an input protein sequence), developed originally for budding yeast cells! Luckily, the nuclear import pathway we are most interested in (importin α/β) is highly conserved in eukaryotes, so it will still make high accuracy predictions on mouse or human protein data. I tested cNLS mapper on well-known phosphorylation-regulated shuttling proteins (HDAC4 and FOXO1), verifying that it works. I also wrote code to parse through the Nature mass spectrometry dataset and filter it down from 20,000+ to 10,000 functionally-significant phosphosites to analyze. I conducted filtering based on p-value and q-value calculations.
I expect my code to continue running over the weekend, as the Wynton cluster is currently down (my terminal is full of “queue, waiting” messages). I’m excited to return next week to share updates on this project!
Leave a Reply
You must be logged in to post a comment.