Week 7: Finished Model

April 26, 2023

This week, I created my custom model. In this blog entry, I will detail the different code segments of my model.

Firstly, I had to create a tokenizer. As a refresher, Tokenization is the process of splicing character sequences into pieces, called tokens, and often eliminating certain characters such as punctuation. The purpose of tokenization is to enable an analysis of sentences of text by mathematical models such as neural network architectures including long short-term memory networks.

I had done some preliminary steps in tokenization such as cleaning text, but now I was able to apply the TORCHTEXT.DATASETS pipelines on the AG_NEWS Dataset. The AG_NEWS dataset can be found at https://paperswithcode.com/dataset/ag-news, and the documentation for the TorchText code and library can be found at https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html.

The first code file I will explain is classify_english_tokenizer.py. This is the file where I tested the tokenizer. I used the documentation from the TorchText link to explore the pipelines and build the model’s vocabulary. This was done using the AG_NEWS dataset, which was a pre-decided dataset just for testing purposes. When I understood how this worked, I ran it to visualize its results on Wandb.

As seen in the image, the training cost for this model stagnated. The model was simply being run on the AG_NEWS dataset, and ultimately gave a high, but flat result. This would change when ran on the real dataset with fake and true news.

Main Model

My main model had various steps towards being built. Firstly, I had to build a tokenizer. This is what I explained in the previous part of the blog.

This is the code that ran the sequence.

Next, I had to load the tokenized text (not the raw text) and the batch size. Afterward, I moved on to creating the Network. This was the class TweetClassifier. Here is an excerpt from this code.

For the following code segment, I have created a for loop in which each sentence is cut to the same length. Next, the program compares model predictions to true values. It calculates F1-Score, precision, recall, and more. There is 1 iteration of gradient descent, and closes with training and evaluation. This process is repeated for each epoch in the for loop! Below are some evaluations of my model.

From next week, I will analyze the results of my model, make corrections as needed, and explore the applications of my findings. Thank you for reading!

View more of Neha N.'s posts.