Week 8: BLAST Algorithm
April 23, 2025
Hi everyone, welcome back to my blog! In the last blog post, I delved into a little bit of the problems that I was facing due to the dataset cleaning process and how that impacted the results that I was receiving. The first order of business was determining compatibility.
Because the O104:H4 strain that I had downloaded from NCBI was a pathogenic strain while NCM3722’s strain was from a lab, there was a higher chance of inversion, insertions, and deletions. Therefore, I had to find another lab strain that I could use to compare to NCM3722 for better algorithmic accuracies. I decided to start even simpler with K-12 MG1655, which is the most common reference lab strain and is used widely in research. K-12 MG1655 and NCM3722 are evolutionarily similar in that NCM3722 is descended from K-12 MG1655, so there will not be a significant presence of insertions and deletions within the strain.
This week, I aimed for a different approach. Before, I was trying to utilize simple sorting and indexing to compare for mutations in one of the genomic strains. However, when these E. coli strains are sequenced, they are given locus tags that differ from strain to strain, even if they refer to orthologous, or equivalent, genes. I decided to do some more research on methods that are used to compare genomic strains, and I landed on the BLAST algorithm, which I mentioned in my introductory blogs.
The purpose of the BLAST algorithm is to find regions of similarity between biological sequences, and I hope to utilize it to iterate through these genomic sequences to find areas of similarity and mutations. The following screenshot highlights the tabular form of the results that the BLAST comparison derived.
Query_id and subject_id both highlight the tags given to each protein-coding sequence within the genome sequences. It can be seen why the previous algorithm could not identify mutations, as each sequence ID has a different name. The %identity highlights the percentage of matching sequences, while alignment_length demonstrates the number of aligned nucleotide bases. This highlights the presence of exact coding sequences, which shows a progression from the previous algorithm.
For the next blog, I will continue working on utilizing the BLAST algorithm and finalize the mutation detection. Until next time!
Leave a Reply
You must be logged in to post a comment.