Week 3: Trouble in Statistical Waters
March 18, 2024
It wouldn’t be a compelling story without a little conflict, now would it?
Hello, adventurous diver! I’m pleased that you’d like to join me on my journey through the caverns of statistics, programming, and random Internet forums. While last week I explored the wonders and horrors of shark films, this week I experienced the horrors of trying to analyze shark populations, so strap in because we’re about to be whirling through some numbers that I can only hope to understand and improve.
The Sunlight Zone: Clear Surface to Dark Depths
First, let me introduce you to the goal of the week, and likely next week as well now. To delve deeper into the issues and significance of media representation of sharks and how they affect White Shark populations, I will be attempting to determine which factors, according to those tracked and consistently filled by the Australian Shark Incident Database, tend to result in shark attacks and which years are notable (large increase or decline) in terms of sharks caught from the New South Wales (NSW) and Queensland (QLD) Shark Control Programs, and the Game-Fishing Club of South Australia (GFCSA), utilizing the data from “A Review of the Biology and Status of White Sharks in Australian Waters.”
With these objectives in mind, I’ll run you through my mindset going into Week 3. For anyone who has taken and remembers AP Statistics, you should be pretty familiar with z-tests and t-tests. However, for everyone else who actually could put down their calculators or forgot the course material after the AP test, I’ll include a brief summary here. Z-tests and t-tests are statistical “tests” (shocker, I know) that are used for different types of datasets to determine how “significant” or how much a certain value deviates from a normal (bell curve distributed) distribution of the data. Z-tests are generally used for larger (greater than 30) sample sizes and when standard deviation is known, and t-tests are used for smaller (less than 30) sample sizes and when standard deviation can be approximated.
At first glance, these tests seem perfect for the “noting significant years” part of this week, but the problem I quickly realized was that the “number of sharks caught” is only a count and can have (ideally) no standard deviation unless I assumed there was some arbitrary error with these organizations’ counting methods. You may have also noticed that there’s no way to account for a variety of factors, most of which were categorical or qualitative values, and their relationship to shark incidents. As a result, I needed to look beyond my classroom analysis techniques and do a little improvisation.
Twilight Zone: Finding the Current
As with any good research, it requires collaboration, so confused on where to look next, I asked some friends for some guidance. After a while I ended up with some “decent” methodologies for determining which years were “significant” and, more or less, analyzing the factors of shark incidents (thanks, Sidhant?).
For noting crucial years, I used a rolling average of the previous five years of White Sharks caught in the observed program to determine if one specific year deviated from this prior trend. For example, if we were looking at the years 1955-1956 in the NSW Shark Control Program, the years taken into consideration for the rolling average would be 1950-1955, resulting in an average of 15. Then, I created a range of an arbitrary 30% above and below this value, that is to say 1.3 times 15 (19.5) and 0.7 times 15 (10.5). Because 15 is within this range, I determined this year to be “insignificant” and following the general trend of sharks caught as years before. On the other hand, if we looked at 1957-1958 with 10 White Sharks caught in the year, a rolling average for 1952-1957 is 17.2, and a range of 12.04-22.36, we can see that this datapoint shows a “significant” decline in sharks caught for this one year.
I continued this method for the rest of NSW’s, QLD’s, and GFCSA’s data, and you can take a look at the new color-coded graphs through this link: https://docs.google.com/presentation/d/1g-7EowtLYajwYeP9DA2dLqW8S3O6qOdfgeL8F5L6qxo/edit?usp=sharing. (Significant declines are noted in yellow, and significant increases are noted in red.) This method wasn’t the cleanest way of noting significance and includes a bit of arbitrariness, and I’m still investigating better ways of analyzing this data, but this progress is definitely useful to start forming some preliminary conclusions and finding possibly impactful time periods’ media to study.
For relating various factors to shark incidents, I had more trouble. I was pointed to Correspondence Analysis which is fundamentally a method of integrating a variety of variables into “components” and determining these variables’ relation to these components and thereby each other. The specific Correspondence Analysis method I looked into was Multiple Correspondence Analysis (MCA), useful for analyzing categorical data though it is somewhat inefficient by treating each possible value in an identifying column as its own category through a method called “one-hot encoding.”
Now, doing my best to explain a topic I’m not fully sure of, I will describe my use of MCA to the best of my current understanding. Starting off, I created the model with another (slightly less) arbitrary 83 components. The reason that I chose 83 was that this number of components accounted for all of the eigenvalues (solutions) and variance (spread) in my data. These components integrated the numerous factors I filtered down from the Australian Shark Incident Database to varying degrees: some factors with 0% and others with 25%. In theory, all of my data points could be graphed on an 83-D plane, but that makes as much sense to you and me as it does to a krill’s conception of a blue whale, so I’ve graphed “Components 0” and “1” on a 2-D plane.
Taking a look at the graph above, you will definitely be overwhelmed at first glance (and this is the graph zoomed in). However, all these are saying are the “correspondence” of each factor or incident to one another, where blue dots are “columns” or the identifying factors of a shark incident and orange dots are “rows” or each individual incident. The first graph shows us that a lot of cases are “related” to clothing color being black, victims boarding, or the season being Spring, but another important thing to note with these graphs is location of these points: they are all close to the origin of (0.00, 0.00) which means that these factors are all similar because they are unrelated or have little connection to Components 0 and 1. That doesn’t mean that these factors aren’t related to any other components, but relatively for these components, they are (confusingly) “related by being unrelated.”
Looking at this second plot, we can see that we’re much further out from the origin at least for Component 0, but there are also much fewer points. From this distance, we are able to make some better, relative conclusions that purple clothing, a shark length of 1.5 to 1.9 meters, and the blacktip reef shark are related for Component 0 and, somewhat, for Component 1. While I could go through all these graphs and their variations, the biggest problem I’ve run into now is interpreting and drawing meaningful conclusions from this data.
This table holds the contributions of each factor to the numerous components included in my MCA model. From what I have researched online, I believe that if the percentage of a factor’s column contribution is greater than the average inertia per column (100%/number of unique categories = 100%/88 = ~1.2048%), then that factor is a “significant” contributor to that component. I’ve made decent progress in this direction, but I think I might be at a dead end as it’s difficult to truly say what a “component” is in a practical sense and make conclusions on what factors are pivotal or greatly related to shark incidents as a whole.
A Ray of Sunlight or Bioluminescence in the Twilight Zone?
So, where do I go from here next week? Well, I’ve done some research and a process called Multiple Factor Analysis (MFA) seems promising or at least more accurate than MCA as it separates the different categorical values by the factors they fall under (like White Shark, Tiger Shark, Bull Shark, etc. under Shark Name instead of treating each of them as separate categories). I did try working with that this week, but the Python module I’ve been using (prince) seems to have some fairly outdated documentation preventing any of my programs from working, so I might look into a different module if I want to continue on this Correspondence Analysis route.
Instead, I’m leaning towards a method my external advisor pointed out to me called Multivariate Analysis that seems promising, but we’ll just have to see next week in my adventures in the seas of sharks, statistics, and oh-so-much Python. I’m also planning to see if I can polish my yearly shark catch analysis method and email some researchers and/or marine conservationists to hopefully get interviews and greater insight into the media-shark nightmare around the world.
Citations
Malcolm, H., Bruce, B. D., & Stevens, J. D. (2001, September). A Review of the Biology and Status of White Sharks in Australian Waters. CSIRO Marine Research, Hobart. https://publications.csiro.au/rpr/download?pid=procite:1d0d13e5-7a60-4e65-be78-636e6f2dd22e&dsid=DS1.
Meagher, P. (2024). Australian Shark Incident Database [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10476905.
Leave a Reply
You must be logged in to post a comment.