Week 3: Aggregating A Dataset
Hello, and welcome back to my blog!
This week, I looked for a dataset that had news article headings labeled “true” or “false”. I searched through two of the most famous dataset databases available on the internet: Kaggle and Github. After combing through both sources, I was able to find multiple great datasets.
Ultimately, I chose to go with Kaggle because it was an easier user experience and I was quickly able to get comfortable with it. Also, the GitHub links I found had repositories in which the creator already built a model off of the dataset. Since I wanted to try something new, I didn’t want to get influenced by the pre-built models, which is another reason why I went with Kaggle over GitHub.
By narrowing down the possible datasets on Kaggle, I had to choose between the 4th and 6th links. Both were very good datasets, but I chose the 6th link because it had both article headings AND article text. If later in the project I decided to diversify my model into reading more than just the article headings, I will be able to use the same dataset. Thus, I stuck with the 6th link, “Fake and real news dataset”, which seemed best suited for my project.
In this dataset, real articles were compiled from Reuters, a credible news website. Fake articles were collected from unreliable websites that were flagged by Politifact, a fact-checking organization. The dataset contains various types of articles on different topics, but the main focus is mostly on political and world news topics from 2016-2017 containing 38,729 articles in total. Each article has a title, text, type, and published date.
To preprocess the dataset, I removed all capitalization and non-alphanumeric characters and filtered them down to the title of each article, and each word was tokenized to an integer number prior to training the model. As I stated in the first blog post, tokenization is the process of splicing character sequences into pieces, called tokens, and often eliminating certain characters such as punctuation. The purpose of tokenization is to enable an analysis of sentences of text by mathematical models such as neural network architectures including long short-term memory networks.
Thankfully, I did not have to comb through the internet to find more data because the one I found was very robust and clear to use. I didn’t have to do too much sorting/cleaning because the dataset was well-organized, and since it already had a vast amount of data, I didn’t need to merge it with an additional dataset. Doing so would have been extremely cumbersome and tricky. Since this is my first time doing a project with NLP, I sincerely hope this dataset will work well with the model I am about to create!