Week 10: Data Collection Wrap Up!!!
April 25, 2025
Welcome to week 10 of my blog! Not a lot has happened this week, but I’m excited to share my last couple of significant findings for my project. For the next few weeks, I will be focusing on wrapping up my findings, making my slides, preparing my presentation, and possibly working on a paper (still finalizing if I am doing this…)
Last week, I shared my findings from comparing the flu infection patterns reported by BERT to those of national flu databases. We observed that while the overall infection patterns were similar, BERT-reported cases tend to happen a couple of weeks before actual cases did. My findings essentially showed that the “cases” reported by BERT may not always indicate actual infection. Instead, they likely reflect early public concern surrounding the disease, leading users to associate their symptoms with the flu or be overly worried about their condition.
From this, I reasoned that something else must have been causing BERT-reported cases to appear earlier than actual confirmed cases. Most likely, users were encountering information online—through social media posts, news reports, conspiracy theories, or public health campaigns warning about potential outbreaks. This early exposure could make people more aware and concerned about their health, prompting them to post about their symptoms before a real surge in infections occurred.
This led me to conclude that Reddit posts reflect not only true infection rates but also broader public concern about the outbreak. To support this idea, I compared the timing of BERT’s influenza reports to the timing of flu-related keyword search peaks on Google Trends. Since the internet—especially Google—is widely accessible and used freely by people across the country and around the world, Google Trends offers a strong indicator of public concern during major events and outbreaks. I selected the keywords “flu”, “cough”, and “influenza” as they were the most commonly searched keywords related to the flu and showed clear seasonal-related peaks. I hypothesized that if Reddit posts capture both confirmed infections and general concern, then the patterns in BERT-reported cases and flu-related Google search trends would also peak around the same time.
Important note! Although Google Trends is great for gauging public concern surrounding an outbreak, it is not as good for case tracking…Around 2008, a project called Google Flu Trends tried to use patterns in flu-related keywords to track and predict actual flu cases. However, it was so easy for people to search up information surrounding the flu that even people who weren’t sick were conducting searches, and Google Flu Trends wildly overestimated the actual number of flu cases.
Google Trends plots keyword popularity on a normalized score from 0 (least popular) to 100 (most popular) in a specified time period. To compare BERT reports against Google Trends, I converted BERT’s reported proportions to a scale of 0 to 100.
(Since the graph is still blurry from my end when I upload: Red = BERT, Blue = “flu”, Green = “cough”, Purple = “influenza”)
We see here that the peak for BERT reported cases and the keywords “influenza” and “flu” are pretty close in time. The peak for “cough” precedes the other 3 peaks by a bit and has slightly different trends overall, but this is likely due to many other non-seasonal diseases having cough as a symptom.
When I conducted a time-adjusted Pearson correlation test, I found that BERT reports correlated strongly with Google Trends searches for “flu” and “influenza,” with coefficients of 0.976 (lag = -1) and 0.970 (lag = -2) respectively. With a much stronger correlation and less time difference between the trends, the data suggests that BERT reports match patterns on Google Trends far better than patterns from true case reports. This supports my hypothesis that Reddit posts reflect both public concern and potential true infection, making them a valuable tool for predicting influenza cases and gauging interest in and preparedness for outbreaks.
That being said, I want to emphasize that my Reddit sample is not perfectly representative of the U.S. population, let alone the entire world. A significantly larger sample, taken from more subreddits, would be necessary for a perfectly representative study. Nonetheless, given my available resources, my findings may still have important implications, demonstrating that Reddit can serve as a powerful informal data source with (1) strong predictive power for real influenza infection trends, especially in the U.S. and Canada, and (2) the ability to capture patterns of public concern that can help assess public health preparedness and interest.
Ideally, the next step would be to determine if the public concern displayed on Reddit is a “normal” level, or is exacerbated by extreme media coverage so that strategies to halt panic and misinformation can be developed. However, there is currently no established benchmark for how many weeks in advance concern can appear before actual cases without being considered irrational. This benchmark can only be determined after years of data collection and analysis while accounting the influence of every public health campaign and news coverage put out. As of now, I cannot complete this step with limited historical data and resources. However, this would be an insightful extension to my project in the future.
As of now, I don’t have any more big findings to share with you all! It was awesome to be working on this project the last few months. I will be working on my slides and paper, and will continue to share progress on those.
Thanks for checking in this week and see you soon!
Leave a Reply
You must be logged in to post a comment.