Week 5: Standard Model Solved

April 10, 2023

This week, I got to work on tackling the issues with my code. With the help of my external advisor, I was able to solve multiple bugs in my program. I realized that I had many misconceptions about how the standard model worked, and I was immediately thrown into more research. By digging through multiple research papers and PyTorch documentation, I was able to learn more about working with Torchtext. Torchtext documentation was a lifesaver for me, as I was able to work out many of my issues by reading through this.

The link can be found here: https://pytorch.org/text/stable/index.html

After successfully fixing the model and running it, I was able to plot its progress using Wandb, a tool to graph F1 scores for my true and fake test data.

Results:

Step is the number of times the model ran. As seen above, the fake test F1 score had a very good score, in the 0.6 to 0.8 range. Aside from the dip from steps around 300 to 600, the model seemed to do very well. However,

Once again, to track the progress of our model, I used the F1 score. The F1 score balances precision and recall, and is calculated as such:

True Positive / (True Positive + 1/2(False Positive + False Negative))

Data modeling

In addition, I began doing some incidence reports on my data to get a better understanding of how I want the custom model to operate. To do so, I had to write a program to track incidences of keywords, and here are the results I got.

The incidence of keywords in fake news titles is up to 270-fold more common than in true news. The study dataset included 38,729 political and world news articles with 46.23% being identified as fake news. Fake news headlines contained specific keywords at a significantly higher incidence than true news (Table 1). However, calculating simple incidence rates for words in sentences ignores the contextual information within each sentence which is important for accurately distinguishing between fake and true news. For example, the word ‘Supporter’ appeared 272 times more often in fake news than in true news. Conversely, the word ‘China’s’ appeared 149 times more often in true news than in fake news (Table 2). The higher incidence of keywords in fake news helps enable the detection of fake news regardless of the ability of models like LSTM and RoBERTa which contextualize words within each sentence.

Word	Fake Incidence	True Incidence	Difference	Fake To True Ratio
Supporter	272	1	271	272.0
Video	8299	31	8268	267.7
Sarah	137	1	136	137.0
Rips	128	1	127	128.0
Rant	120	1	119	120.0
Cops	239	2	237	119.5
Stunning	116	1	115	116.0
Perfect	115	1	114	115.0
Shocking	223	2	221	111.5
Breaking	880	8	872	110.0

Table 1. Top 10 words with the largest fold difference between fake and true news.

In an analysis of 38,720 news articles, the incidence of words present in fake and true news was compared and the fold difference was calculated. The top 10 words which have the highest fold difference between fake and true news are displayed.

Word	Fake Incidence	True Incidence	Difference	True Ratio
Chinas	1	149	-148	149
Urge	1	91	-90	91
Turkish	2	155	-153	77.5
Reutersipsos	1	76	-75	76
Expects	1	74	-73	74
Bangladesh	1	72	-71	72
Philippine	1	70	-69	70
Overhaul	1	69	-68	69
Japans	1	66	-65	66
Abe	1	65	-64	65

Table 2. Top 10 words with the largest fold difference between true and fake news.

In an analysis of 38,720 news articles, the incidence of words present in true and fake news was compared and the fold difference was calculated. The top 10 words which have the highest fold difference between true and fake news are displayed.

More incidence reports organized in Google Sheets can be found at this link: https://docs.google.com/spreadsheets/d/1VSbs5e6lqJt1ezjP_YyGifNIaqygWeaV8P49M51ySr4/edit?usp=sharing

Lastly, I have continued the Swift online course on Codecademy and I have been making a lot of progress. Overall, I am glad I was able to handle the setbacks well this week!

View more of Neha N.'s posts.