Week 9: Data Collection

May 3, 2025

Hi everyone, welcome back to another blog! In the last blog post, I demonstrated the different method failures I experienced when creating my algorithm before I was able to identify a specific method, the BLAST algorithm, that was successful.

This week, much of my efforts were focused on ensuring that this BLAST algorithm was working to the highest capacity and efficiency. To understand what I edited and fixed, we have to start all the way at the beginning: data cleaning. When you are analyzing gene sequencing in the form of a .txt file and, later, a fasta file, then what you come across are certain descriptors for each portion of the sequence. There are two main sections we need to be concerned with: ID and description. The local ID uniquely labels a sequence within a given dataset or system, but not necessarily globally. The description, however, contains a locus tag, which is systematically applied to every gene in a genome. Since the default of the BLAST algorithm is to use the local ID, the data had to be edited to ensure that the locus tag was utilized instead, so that the impact on the E. coli proteins could be identified more effectively.

After this data cleaning was completed, the percentage of genes that contained mutations was first identified. This screenshot, seen below, highlights what the algorithm is now currently outputting.

Afterwards, I iterated through the locus tags with mutations to identify a list of mutated proteins while also ensuring there was no repetition. The output can be seen in the screenshot below as well.

I, then, was able to separate these protein mutations within five specific categories: Genetic Pathways, Signaling/Cell Division/Cell Stability, Metabolism, Other, and Unknown; the genetic pathways section was then split into: Regulatory/Alteration, Duplication, and Transcription/Translation. I was able to highlight the highest impacted areas of cellular activity, which I hope to replicate with three other strains before the end of this project using this algorithm.

Within the next week, I hope to automate this process of categorizing proteins while also iterating through different strains and demonstrating the outputs within graphs. Until next time!

View more of Sonya S.'s posts.

Week 9: Data Collection

Reader Interactions

Leave a Reply Cancel reply