Week 7 - Creating Datasets

April 12, 2024

Hello everybody and welcome back to Week 7 of my Senior Project! This week, I spent most of my time compiling the titles of news articles to put into my Sentiment Analysis model. Unlike my first model (ARIMA), I couldn’t find a dataset online to fit my needs for this project. The dataset I need has to be the news titles of all news articles on the company that I am analyzing on a daily basis. Therefore, I had to create my own datasets. In order to do so, I created 2 columns on my Google Sheets, labeled “Date” and “News Article.” The format for the date was the same as the ARIMA model since I knew the datetime module could read that format. Again, I just decided to use the titles of the news articles because I have over 500 articles per company that I am analyzing (since I’m taking the top 3 from each day). The average length of a news article is around 1,000 words, so if I did the full article, it would be over half a million words for the model to analyze, which I don’t think my laptop can handle. The datasets looked like this (I used Amazon as an example):

Now obviously, there aren’t gonna be 3 news articles for every single company every single day. There were days where I could only find one or two, or sometimes even none at all. For these days I just put a “0” because 0 is a neutral number. When the sentiment analysis model analyzes the news article, it will give the article a score ranging from -1 to 1, -1 being negative and 1 being positive. By assigning a value of 0, that is entirely neutral, which I feel best represents the case in real life. If there are no news articles published, then the opinion of the public can’t be changed.

Before this week, I definitely underestimated how much time it would take me to compile these datasets. Looking on the internet for over 500 news articles and manually copy pasting them into my dataset was definitely extremely time consuming. While I wasn’t able to finish all 8 datasets by today, I definitely want to finish them by Monday morning so I can start developing my sentiment analysis model. One of the datasets that I didn’t get to was Sharp’s Technologies. However, I think this is definitely going to be the one with the most 0’s in my data sheet. All the other 7 companies that I am analyzing are extremely big, and therefore have many news articles published on them every single day. Until I did much research on stocks that went public in 2022, I had never even heard the name “Sharp’s Technology.” I hypothesize that it’s definitely going to be much harder to make a correlation between the stock price of Sharp’s Technology and the influence of news articles than other companies simply because there’s very few news articles on smaller companies. To learn if this hypothesis is true or not, you’ll have to wait until next week to find out!

View more of Aashvi J.'s posts.

Week 7 - Creating Datasets

Reader Interactions

Leave a Reply Cancel reply