Week 3: Preparing The Dataset
March 24, 2023
Hi everyone, and welcome back to my blog! This week I decided to start working with the OSMI dataset so that I could start training my models. I began sending out the survey I created last week, and as of now, I have received 41 responses. I’m hoping that I can receive at least 100 within the next few weeks, so that I can begin looking at trends within that data as well.
Now, moving on to working with the OSMI Mental Health in Tech Survey dataset, I noticed that I would need to clean the dataset before I could do any initial work on the data.
Cleaning The Dataset:
The first thing I noticed in the dataset was how some individuals put in ages that were not biologically possible, such as negative numbers and numbers over 100. In order to fix this, I decided to clip the age range to between 18 and 70, so that the ages would remain within a realistic range for a tech employee.
Next, I noticed that the gender column was free response, and many users had put in many different responses, so I decided to condense these into 3 main categories: male, female, and other. This would also lead to better results when training a model.
I also made sure all that individuals had a treatment reported, since this is the variable I would be looking at when training the model.
Then I dropped columns with too many null values or that were unrelated to my project, including the ‘comments’, ‘state,’ ‘work interfere,’ and ‘timestamp’ columns. Finally, I noticed that there were a lot of different countries reported by individuals, so I decided to condense that column into ten main countries, with the remaining being labeled as other.
Finding The Baseline Accuracy:
In order to have a starting point for my model, I needed to find the baseline accuracy, which in this case would be the percent of individuals in the dataset who have received treatment for their mental health. In this case, the baseline accuracy for the model is 50.6%.
Looking At Trends In The Dataset:
After cleaning the dataset, I wanted to examine some of the trends present within the data. The first thing I wanted to look at was the number of individuals who received treatment versus the ones who did not receive treatment. As shown by this chart, a little over 50 percent of individuals in the data set did receive treatment for their mental health.
Next, I wanted to look at the correlation between treatment and gender. The segmented bar chart shows if the employees received treatment grouped by gender. Around 45 percent of males have received treatment, 68 percent of females received treatment and 77 percent of those who identify as other received treatment. This indicates that females and those who identify as other are more likely to receive treatment.
In order to examine the correlation between company size and employee mental health, I also decided to create a chart with the number of employees in each company. The following bar chart shows the number of employees in companies these individuals work for. The chart shows that most of these individuals are working in smaller companies with less than 500 employees.
I also decided to create a plot showing the correlation between age and mental health, however, this plot showed no definitive trends.
Finally, I created a segmented bar chart to examine the correlation between working remotely and receiving treatment. Around two-thirds of the employees working remotely received treatment, while around a third of the employees not working remotely received treatment. This could indicate that working remotely takes a larger toll on the employees’ mental health, causing them to seek treatment.
I will continue to look at other trends in this data, and when I have the responses to my survey, I will compare those as well. Next week, I will begin creating and training my machine learning models.
Thank you for reading!
Sources:
- Open Sourcing Mental Illness, LTD. “OSMI Mental Health In Tech Survey 2016.” Kaggle.Com, 2016, Www.Kaggle.Com/Datasets/Osmi/Mental-Health-In-Tech-2016.