Week 5: Cutting Dead Weight for New Missions
April 1, 2024
Huh? What? Who are you? Not another statistical model, right? Because if you are, I am going to lose it! Oh, it’s that time of week again: the blog. Well, I have a story to tell, but it definitely isn’t the prettiest.
The Final Struggle (Against the Australian Shark Incident Factors Data)
As I’ve been doing for the past two weeks, I’ve been attempting to crack the enigmatic relationship between a select group of factors picked from the Australian Shark Incident Database. So far, I’ve been the most successful at interpreting the results from Multiple Correspondence Analysis (MCA) on the data, though that did have its own caveats. Namely, it was difficult to understand what each “component” or percentage grouping of various factors meant and how to draw conclusions from these components. However, I believe I’ve since figured that out, while eliminating numerous other analysis methods.
As a reminder, last week, I was tinkering with Multivariate Analysis and (K-Means) Cluster Analysis. I have not been able to get those functioning in greater capacity and have added two new Cluster models to my accumulating graveyard: the Gaussian Mixture Model (GMM) and the Spectral Clustering model.
GMM relies on its namesake, Gaussian Distributions or, more commonly known, normal or bell curve distributions. While K-Means Clustering is efficient at finding spherical clusters, GMM could find clusters of a variety of complex and flexible shapes assuming that the data follows a Gaussian distribution. All of this description sounds extremely appealing but when applied to my data, it runs into several problems. First, the Australian Shark Incident Database deals with categorical data, and secondly, even after one-hot encoding and standardizing the data, one cannot simply assume that the data follows a normal distribution. As a result, I was left with the rather unsatisfactory graph below that clearly didn’t say much.
On the other hand, Spectral Clustering is intended to work with high dimensional and complicated datasets, specializing in dimension reduction. High dimensionality was definitely a problem in my case, but how this model reduces dimensions is by “predicting” how specifically two factors relate to one another which would neglect everything else I was trying to analyze as well. Consequently, this model was a dead end as well.
Going back to Week 3, I discussed trying Multiple Factor Analysis (MFA) as it seemed to be a more accurate version of MCA because it grouped the categorical values of each factor into their columns, observing the relationships between both individual values and columns instead of treating each value as its own category. I returned to this model hoping for good news, but unfortunately this time, I was just stopped by a mixture of my lack of experience in the topic, a lack of documentation from the Python module, and some incompatibility in the database.
Finally, this brings me back to the beginning of this journey: MCA. If you want a more indepth rundown of the model, refer back to Week 3’s blog, but I’ll quickly summarize it here. MCA uses the multiple categorical value types I’ve listed in the Australian Shark Incident Database to form a certain number of components (in this case an arbitrary 83 to account for all variance in the model) which are combinations of these values. From these components, each incident factor and case has its own percentage contribution to each component. These percentage contributions are determined to be significant if 100% divided by the number of rows or columns (depending on what is being analyzed) is less than that value’s contribution to the component.
Now adding on what I learnt this week, I believe I can use significant percent contributions for individual factors in components to figure out what factors tend to relate to one another. This form of analysis would already be much better than simply looking at the MCA plot of two components one by one and measuring how far points are away from each other, and the reason that I think this makes sense is from Component 0. This component has a 13% contribution from river locations and an 8% contribution from bull sharks (which are both significant because 100%/83 factors = ~1.2048%). This sort of relationship would make sense for a component that is attempting to correlate freshwater shark incidents which, though uncommon, would tend to involve bull sharks who can survive for extended periods in freshwater. Additionally, while I had believed that knowing row contributions for the dataset was rather pointless, I now see that they can be quite important for understanding which cases are influencing the percentage contributions of each factor for each component.
One more tidbit I deciphered was that the percent variance listed for each component is meant to represent the “explained variance” which means that the percentage listed for each component indicates how much of the total variance in the original dataset is explained by the (principal) component. Thus, because the earlier components have higher eigenvalues and variance, they are more significant to the dataset.
These discoveries and trials probably mark the end of my saga with different methods of analyzing the Australian Shark Incident Database’s factors. I will likely be returning to MCA later on in my project to draw some further conclusions from this data and research why some of what I will observe will happen, but for now, I’ll be continuing a different mission I started this week.
Constructing a Timeline
One of the main goals of this Senior Project is to connect significant increases or decreases in Australian White Shark catches, a proxy for their population numbers, to relevant media of the time, whether that be newspapers, audio recordings, movies, government or organizational rhetoric, etc. This week marked the start of this attempt, and I started by referring back to the color-coded significant graphs I created and improved upon in Weeks 3 and 4. I used the noted “significant” years to pinpoint periods of time that I wanted to further research. Currently, I’ve made it up to the 1960’s, combing through Trove (an archive of Australian media) for newspapers that mentioned the word “shark” and noting the general sentiment, wording, context, and other miscellaneous information about the reporting of these animals. Overall and rather predictably, the vast majority of these articles were written around shark attacks, sightings, or catches with only a slim minority being about regulation and “counter-information” to combat the fears people already had of the time of sharks in the ocean. However, I still have many more decades, articles, and mediums of information to go through new week, so I’ll be making a note of any differences.
That concludes this week’s blog. I did have one tiny update related to the emails I sent out last week for interviews. I heard back from one conservation group: the Shark Conservation Fund (SCF). SCF works with researchers and volunteers to fund Marine Protected and Conserved Areas and encourage policy development or reform related to conservation and fisheries. While they weren’t able to schedule an interview, they were able to answer some of my questions, and I’m extremely excited to incorporate this information into my project. Thanks for cruising by, and I hope you drift by again next week when I’m hopefully well-versed in Australian shark journalism!
Leave a Reply
You must be logged in to post a comment.