Week 8 - Positive or Negative?
April 19, 2024
Hello everybody, and welcome back to Week 8 of my Senior Project! This week, I did many different things, but in summary, I finished creating the datasets and developed my sentiment analysis model. If you remember from last week, I had started creating my datasets. This week, I was able to continue working on the same datasets and create the rest of them. As predicted, the dataset for all the news articles with Sharp’s Technology had the most 0’s (which means there weren’t many news articles on this company).
After creating all of my datasets, I needed to develop my sentiment analysis model. However, I realized that in order to compare the results of my sentiment analysis model to the results of my ARIMA model, I would need to save the results of my ARIMA model in a dataset. Therefore, I decided to take a step back and go back to my ARIMA model to save all the forecasts in a dataset. The process for doing that was pretty simple: I just had to add one command to create a new dataset with two columns (titled “Date” and “Forecast”) and store all the values accordingly. However, I did get some error messages as when the ARIMA model tried to graph the data, now the column titles were getting in the way so it couldn’t understand how to graph “Date” and “Forecast” since they weren’t real numbers (obviously). This wasn’t a difficult problem to solve, though, since I just removed the dataset titles and manually added them into the file after the program was done running. An example of the data (from Amazon is shown below):
Next, it was time to develop my sentiment analysis model. From past projects, I had a lot of experience with the NLTK library, which uses the SentimentIntensityAnalyzer and the VADER lexicon to assign a piece of text a sentiment score. Using this algorithm, I was able to assign each news article a sentiment score (from the scale of -1 being the most negative to 1 being the most positive). However, there was still a pretty big discrepancy between the results of my ARIMA model and the results of my sentiment analysis model. My ARIMA model had one data point for each date (which was the daily forecast) but my sentiment analysis model had three (the sentiment score for each news article that day). Therefore, I decided in order to keep just one data point per day, to take the average of the sentiment scores of all three news articles and keep that as the sentiment score of news for that day. Finally, I wanted a qualitative measure of whether the news was positive/negative, not a quantitative score, so I added a conditional statement where if the sentiment score was above 0, it was positive, if it was below then negative, and if it was 0 then neutral. The final sentiment analysis results (for Amazon) looked like this:
Currently, I am in the process of running the sentiment analysis algorithm on each of my companies. While this data can definitely be used for comparison to see the effect news articles have on stock prices, there is one drawback to this sentiment analysis model: I don’t really have a way of measuring how accurate it is. When designing the ARIMA model, I was able to take real life stock prices from September 1, 2023 to March 1, 2024 (the time frame I was analyzing) and compare them to the forecasted prices to determine how accurate the forecasts were. However, since I manually created the dataset for the news article, nobody has used it before me and conducted sentiment analysis on it, so I don’t have any test data to verify the results with. I want to spend the coming week running the sentiment analysis algorithm on the remaining companies and researching how I can find the accuracy of this model (if there is even a way to do so). But will I be able to do so? Stay tuned until next week to find out!
Leave a Reply
You must be logged in to post a comment.