Week 5: Standard Model Solved
April 10, 2023
This week, I got to work on tackling the issues with my code. With the help of my external advisor, I was able to solve multiple bugs in my program. I realized that I had many misconceptions about how the standard model worked, and I was immediately thrown into more research. By digging through multiple research papers and PyTorch documentation, I was able to learn more about working with Torchtext. Torchtext documentation was a lifesaver for me, as I was able to work out many of my issues by reading through this.
The link can be found here: https://pytorch.org/text/stable/index.html
After successfully fixing the model and running it, I was able to plot its progress using Wandb, a tool to graph F1 scores for my true and fake test data.
Results:
Step is the number of times the model ran. As seen above, the fake test F1 score had a very good score, in the 0.6 to 0.8 range. Aside from the dip from steps around 300 to 600, the model seemed to do very well. However,
Once again, to track the progress of our model, I used the F1 score. The F1 score balances precision and recall, and is calculated as such:
True Positive / (True Positive + 1/2(False Positive + False Negative))
Data modeling
In addition, I began doing some incidence reports on my data to get a better understanding of how I want the custom model to operate. To do so, I had to write a program to track incidences of keywords, and here are the results I got.
The incidence of keywords in fake news titles is up to 270-fold more common than in true news. The study dataset included 38,729 political and world news articles with 46.23% being identified as fake news. Fake news headlines contained specific keywords at a significantly higher incidence than true news (Table 1). However, calculating simple incidence rates for words in sentences ignores the contextual information within each sentence which is important for accurately distinguishing between fake and true news. For example, the word ‘Supporter’ appeared 272 times more often in fake news than in true news. Conversely, the word ‘China’s’ appeared 149 times more often in true news than in fake news (Table 2). The higher incidence of keywords in fake news helps enable the detection of fake news regardless of the ability of models like LSTM and RoBERTa which contextualize words within each sentence.
Word | Fake Incidence | True Incidence | Difference | Fake To True Ratio |
Supporter | 272 | 1 | 271 | 272.0 |
Video | 8299 | 31 | 8268 | 267.7 |
Sarah | 137 | 1 | 136 | 137.0 |
Rips | 128 | 1 | 127 | 128.0 |
Rant | 120 | 1 | 119 | 120.0 |
Cops | 239 | 2 | 237 | 119.5 |
Stunning | 116 | 1 | 115 | 116.0 |
Perfect | 115 | 1 | 114 | 115.0 |
Shocking | 223 | 2 | 221 | 111.5 |
Breaking | 880 | 8 | 872 | 110.0 |
Table 1. Top 10 words with the largest fold difference between fake and true news.
In an analysis of 38,720 news articles, the incidence of words present in fake and true news was compared and the fold difference was calculated. The top 10 words which have the highest fold difference between fake and true news are displayed.
Word | Fake Incidence | True Incidence | Difference | True Ratio |
Chinas | 1 | 149 | -148 | 149 |
Urge | 1 | 91 | -90 | 91 |
Turkish | 2 | 155 | -153 | 77.5 |
Reutersipsos | 1 | 76 | -75 | 76 |
Expects | 1 | 74 | -73 | 74 |
Bangladesh | 1 | 72 | -71 | 72 |
Philippine | 1 | 70 | -69 | 70 |
Overhaul | 1 | 69 | -68 | 69 |
Japans | 1 | 66 | -65 | 66 |
Abe | 1 | 65 | -64 | 65 |
Table 2. Top 10 words with the largest fold difference between true and fake news.
In an analysis of 38,720 news articles, the incidence of words present in true and fake news was compared and the fold difference was calculated. The top 10 words which have the highest fold difference between true and fake news are displayed.
More incidence reports organized in Google Sheets can be found at this link: https://docs.google.com/spreadsheets/d/1VSbs5e6lqJt1ezjP_YyGifNIaqygWeaV8P49M51ySr4/edit?usp=sharing
Lastly, I have continued the Swift online course on Codecademy and I have been making a lot of progress. Overall, I am glad I was able to handle the setbacks well this week!