Filling In Lines – Part II
Note: This is the second and greater of two posts. The previous part serves as a general introduction to the content in this post. For the sake of comprehensibility, I recommend reading the part I first.
I answered the first question employing BLAST, an algorithm which, among other things, is capable of quickly assessing the body of available samples meeting some specification on their similarity to an input sequence. I used it to identify the pool of available sequences of P. Vivax containing a sequence with a high degree of similarity to the reference. Once the BLAST algorithm had given me a list of similar sequences, I needed to assess which ones could me easily mapped to my original sequence to have come from the same gene and had enough geographic information attached to them to be useful to my study. Narrowing this down was pretty quick as the build of BLAST I worked on primarily made any metadata necessary for further inquiry readily available.
Now that I had my list of nucleotide sequences, I would translate them. This was necessary for two reasons. One, evolutionary selection occurs for the traits which amino acid sequences promote, not the underlying genetic information which promotes them. Two, amino acid sequences are a lot easier to work with initially: the nomenclature for identifying amino acids makes it far easier to pinpoint changes (polymorphisms, henceforth) and the process of translating a nucleotide sequence into amino acids divides the number of entries one needs to work with by three.
Translation was more complicated than it needed to be. To accomplish the task efficiently, I, for pretty convoluted reasons, needed my whole list of nucleotide sequences to be properly aligned, both internally and with respect to each other. Getting the sequences to this format and then ensuring they would be recognized as such took way too long, but any frustration was worth it: I had my amino acid sequences.
I would reformat them so that entries would only appear where there were variations from my base sequence. Then I transferred this to a very large Excel file where I would systematically eliminate all entries corresponding to positions from my table where there was no variation and thus no substantive selection. Working through the data, checking each position to ensure no relevant data was lost, took probably around 3 hours.
With all the variations put in one place, I would the most time-consuming part of my inquiry: geographic identification. To do this, my original set of DNA sequences was processed using a platform maintained by the University of Barcelona which identified the origin of each recorded variation. This, though it accelerated my progress considerably, produced an output which was imperfect. Essentially it identified within a single text document what country each amino acid sequence, called haplotypes, had been identified as coming from by the authors of the paper which had introduced it. I don’t say that it simply identified national or geographic origin per se, because these designations were left entirely to the specifications of the authors being drawn, leaving the data set fully susceptible to inconsistencies in naming systems and even spelling mistakes – Colombia with a “u”, for example. The upshot? Relatively little could be done to automate geographic identification without introducing all kinds of potentially project compromising error. This meant I was left having to effectively work through the entire data set more or less by hand. I did this with what I ultimately found to be shocking accuracy, about 99%. But even small errors could not be allowed to go uncorrected. I found this out when I made the mistake of advancing to a further step in categorizing the polymorphisms before I had triple checked some of my results.
A total of four misidentifications, among thousands of potential specifications, despite extensive cataloging on my part, forced an entire day of backtracking. When I finally got this sorted out, when all of the geographic data was lined up with the modifications to my base sequence, I could get to answering question 3. This involved going through each substitution that I observed of an amino acid and taking its frequency within a certain region weighed against its overall frequency. Performing this process, I discovered that variations on PV47 are most strongly with continental origin, even over national origin. This was derived from the observation of a few strongly predictive polymorphisms, the strongest linking the Americas with 99.5% certainty.
Armed with this information I could start engineering a test, which is where I’ve been essentially for the last two-and-a-half weeks. I will describe what this has meant in my next post.