Week 2 - Its all about the Data

March 8, 2024

Hello everybody and welcome back to my blog! This week, I’ve been trying to compile the data needed as input into my ARIMA model. If you remember from my last post, the ARIMA model represents stock forecasts without external influences such as news articles. In this post, I’ll talk about the process of preprocessing and visualizing my training data to prepare for the ARIMA model to run forecasting algorithms.

Acquiring Datasets

My plan at the beginning of this project was to acquire the daily opening stock prices of the 7 companies I chose (Meta, Google, Nvidia, Tesla, Amazon, Apple, Microsoft) from March 1, 2023 to September 1, 2023. I was able to find this information on Yahoo Finance and was able to apply filters such as my desired start and end dates as well as the increment at which I wanted the data. After setting my desired preferences and preprocessing the data, I realized that I only had around 129 data points as there were many dates, such as weekends and holidays, during which the stock market was closed. I wanted many more data points to give the model, as the more training data/data points it has to learn from, the better the model’s predictions will be. This is because having more data allows the model to detect patterns, rather than just one time occurrences which may be the case with fewer data points. Therefore, I decided to expand my time frame for training data to September 1, 2022 to September 1, 2023. This gave me 252 data points which was sufficient for inputting into the ARIMA model. An example data set (Google) that I used from Yahoo Finance looked like this:

Preprocessing Data

As shown above, the data from Yahoo Finance had much unnecessary information and was also formatted in a way that the model would not understand. There were also days such as when dividend payments were made, which I had to eliminate because I did not want those values inputted into the model. Additionally, I only wanted the date and the opening price of each stock. Therefore, I had to eliminate columns such as the highest/lowest price, closing/adjusted closing price, and the stock volume. To do this, I imported the dataset into Google Sheets and manually altered it to fit the requirements of the ARIMA model. At the end, I was left with two columns: the dates and opening prices. The preprocessed dataset for Google looked like this:

Visualizing Data

This week, I executed part of the ARIMA model by creating a graph of the data values I collected over the past few days. To do so, I imported two Python libraries, pandas and matplotlib. First, I called pandas, specifically the read_csv and the datetime functions to read the input dataset. Then, I called matplotlib to graph the values read by the program. I set the x-axis to be the date and the y-axis to be the stock price. The graph for the price of Google’s stocks that I forecasted by my ARIMA model is shown right above the actual charts for Google’s stock. As the model only needed to read a spreadsheet, the accuracy is very high, and the two charts look almost exactly the same.

As shown, Google’s price has fluctuated from September 2022 to September 2023. It dipped towards the end of 2022 to under 90 dollars per share, but rose throughout 2023 and was valued at over 130 dollars per share in September 2023.

My internal advisor, Mrs. Bhattacharya suggested that I should look at smaller companies in comparison to the larger ones that I use in my project. Therefore, I decided to look into some smaller startups that recently went public. I chose to look at the stock of Sharps Technology, which is a technology company that went public in early 2022. I compiled and graphed the stock prices for Sharps Technology, which is shown below:

Upon graphing this data, I immediately saw differences between this graph and those of bigger companies, such as Google. The biggest was that Sharps Technology’s stock fluctuated much more than that of Google’s. Some days had extremely large spikes while others had large drops. On the contrary, Google’s stock remained relatively stable. I think that it will be much easier for the ARIMA model, when making predictions, to be able to forecast values based on a relatively stable company that is in its mature growth phase. After September 2023, Sharks Technology’s stock was unfortunately unable to recover from the drop they faced till date. The ARIMA model definitely does not have the capacity to predict this, as the previous data suggests that the stock price will come back up. Therefore, while it will be interesting to see how the model treats both datasets, I hypothesize that it will be able to better predict the stock prices of larger companies as their datasets are easier to make patterns of.

And that’s it for Week 2 of my Senior Project! Thank you so much for reading and following along on my journey so far and I’ll see you guys next week.

View more of Aashvi J.'s posts.

Week 2 - Its all about the Data

Reader Interactions

Leave a Reply Cancel reply