Week 1: Introduction
March 10, 2023
Introduction
Fake news has pervaded almost every corner of the internet, from fake articles on the web to falsified information on social media. Fake news is broadly defined as inaccurate or misleading content posed as legitimate news, often pervasive and inciting in nature. In the past, harmful groups such as QAnon and InfoWars have spread hatred and fabricated propaganda targeted at specific groups, in which people were unable to distinguish between real news and fake news. Ultimately, fake news can lead to serious consequences for example loss of confidence in the healthcare system, resulting in patients not accessing the treatment they need. Flagging fake news can help prevent the spread of misinformation. However, according to a study by Loughborough University’s Online Civic Culture Centre, in 2019, 42.8% of news shared was inaccurate or false news. Additionally, the estimated total U.S. daily newspaper circulation, including digital news, was 24.3 million on weekdays and 25.8 million on Sundays in 2020. Clearly, there is a plethora of information being communicated on a daily basis, making it unfeasible to parse through by humans.
The immense volume of false information is a problem that can be mitigated with machine learning. If algorithms could check the validity of articles on the internet and flag them for false content, this would significantly lower the risk of deceiving information influencing users. I am eager to be part of the movement to make the internet a safer place for future generations, as well as the technological revolution in the Artificial Intelligence field.
I am interested in Machine Learning, and I am particularly captivated by Natural Language Processing. After reading about various catastrophes that stemmed from misinformation, such as QAnon, Cambridge Analytica, and Facebook privacy litigation, I became very interested in NLP. I have also personally seen fake news and false advertisements on social media and on the internet.
Thus, in my project, I attempt to leverage Machine Learning to classify an article’s headline text as true or fake news. With the help of Dr. Parsa Akbari of Cambridge University and under the guidance of Mrs. Flood, I track two models, RoBERTa and LSTM-RNN, while comparing their progress from their respective F1 Scores to reach a conclusion on which model is best suited for NLP analysis. Ultimately, my model can be implemented on search engines such as Google to flag websites as false if they contain misinformation, which can later be verified by an expert, in hopes of ensuring the validity of online content.
For Week 1, I have been doing research on various techniques already used in Natural Language Processing. So far, I have found five research papers that stood out to me.
Their direct links are listed here:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0250419
https://arxiv.org/pdf/1705.01613.pdf
https://arxiv.org/pdf/2102.04458.pdf
https://www.researchgate.net/publication/363661691_Fake_news_detection_on_Twitter
These papers have given me a good understanding of what has been done in the NLP field already. I learned that many RoBERTa (1) models have already been created for similar NLP tasks. This is the classic approach, which I have become familiar with through reading research papers such as this one: https://www.scitepress.org/Papers/2022/108739/108739.pdf
However, I want to explore more NLP techniques in an effort to increase efficiency. Thus, I will be constructing the classic RoBERTa model and tracking its accuracy using F1-Score (2). Next, I will create a custom-LSTM (3) model from scratch, a different type of model that I believe has better capabilities for text classification, since I intend to create a bi-directional RNN-type LSTM. I will also be changing tokenization (4) strategies and examining incidence reports.
(1)
Bidirectional Encoder Representations from Transformers (BERT) are designed to help computers use surrounding text to establish context for ambiguous language in the text. BERT is based on Transformers, which is a deep learning model in which all output and input elements are connected, and weightings between them are dynamically calculated based on their connection. BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based on their connection. BERT is designed to read bi-directionality and uses Masked Language Modeling and Next Sentence Prediction. MLM training hides a word in a sentence and makes the model predict the masked word based on context. NSP training makes the model predict whether sentences have logical connections or whether the relationship is random.
RoBERTa is an extension to BERT. RoBERTa builds on the same architecture as BERT but modifies hyperparameters, uses a different tokenizer, and uses a different pretraining scheme. RoBERTa removes next-sentence pretraining objectives and trains with much larger mini-batches and learning rates. Building off of BERT’s MLM, RoBERTa uses dynamic masking ]to predict hidden sections of text. RoBERTa demonstrates more potential for self-supervised training techniques to exceed the performance of BERT’s supervised approaches. It can significantly improve performance on various NLP tasks with less reliance on data labeling.
(2)
To track the progress of our model, I will be using the F1 score. The F1 score balances precision and recall, and is calculated as such:
True Positive / (True Positive + 1/2(False Positive + False Negative))
(3)
Recurrent Neural Networks (RNN), are able to connect previous information through loops, allowing them to perform the present task. LSTMs are special types of RNNs that are capable of long-term dependencies, something standard RNNs are unable to do under some circumstances. LSTMs contain multiple neural layers instead of a single layer, where the linear cell state goes through a tanh layer and is multiplied by the output of the sigmoid gates with unimportant information removed. The result is refined information for long periods of time, avoiding long-term dependency problems.
(4)
Tokenization is the process of splicing character sequences into pieces, called tokens, and often eliminating certain characters such as punctuation. The purpose of tokenization is to enable an analysis of sentences of text by mathematical models such as neural network architectures including long short-term memory networks.