Week 6: Some Setbacks

April 14, 2023

Hey guys, welcome back to my blog! This week, I hit a couple setbacks. This past weekend I had a little accident and ended up breaking my hand. I’m doing okay, but I’m in a full arm cast, so right now my sister is typing this. Anyway, to get into what I was able to achieve this week, I first had to pick another type of model to train, and I also tested one of the models on different subsets of the initial data set.

Picking The Next Model:

My initial plan was to use a new type of regression, Poisson Regression, but this type of regression returns counts instead of a model accuracy, so in order to keep consistency within my project, I decided to go with a random forest model, since that would return model accuracy. Even though the Random Forest may not produce the highest accuracy, I decided to use a Random Forest instead of the regression that I planned earlier since it would be easier for me to compare all three models using their accuracy, losses, false positives, and false negatives at the end.

In order to create the Random Forest, I followed similar steps to creating the other models. I split the data set into X and Y values, and then I performed a train test split on the X and Y values, with a test size of 0.2. First, I created and trained one tree and checked its accuracy.

Then, in order to get a higher accuracy, I created and trained a series of decision trees, which is a random forest, in order to get the highest accuracy. The highest accuracy I got was approximately 71%, which is 20.4% higher than the baseline accuracy.

Testing Subsets Of Data:

When testing subsets of the data I decided to use the Logistic Regression model since it was the simplest to use. I decided to drop certain columns and then run the model on the data set. Here is a chart detailing the column I dropped and final model accuracy. As a reminder, the accuracy of the Logistic Regression model was approximately 71% on the entire dataset.

Column Dropped	Model Accuracy
Gender	69%
Age	71%
Family History Of Mental Illness	72%
Remote Work	71%
Number Of Employees	71%
Work In A Tech Company	72%
Supervisor	72%

As you can see, most of these columns did not have too much of an effect on the overall model accuracy. I was surprised to see how the accuracy actually went up after removing certain columns such as the ‘Supervisor’ and ‘Family History’ columns. The column that had the largest effect on the model accuracy was the ‘Gender’ column, and this shows that gender is important for predicting mental health treatment status.

Next week, I will directly compare the models and begin discussing their implications.

Thank you for reading!

Sources:

Open Sourcing Mental Illness, LTD. “OSMI Mental Health In Tech Survey 2016.” Kaggle.Com, 2016, Www.Kaggle.Com/Datasets/Osmi/Mental-Health-In-Tech-2016.

View more of Aashirya R.'s posts.