Week 4: Starting On The Models
March 31, 2023
Hi everyone, welcome back to my blog! This week I began creating my machine learning models. The first model I created was the logistic regression model. I created this model earlier, and this week, I began working on creating a neural network.
Logistic Regression Model:
The first model created was a logistic regression model. A logistic regression model is a form of statistical analysis to predict a binary outcome, such as yes or no, based on the earlier observations made in the data set. In order to do Logistic Regression, I needed to one-hot encode the data. One-hot encoding is done for a categorical variable, where the integer encoded variable is removed and a new binary variable is added for each unique integer value. I one-hot encoded the ‘Gender’ and ‘Country’ columns, making each data point for each new column a 1 or 0.
Next, I split the X and Y values, with Y being the treatment column and X being the rest of the columns.
Then, I performed a train/test split on the X and Y values with a test size of 0.2.
Finally, I defined and fit a logistic regression model.
In order to find the accuracy of the model, I calculated the number of correct predictions over the total number of predictions. The final test accuracy of the model was 71.49%, which is 20.89% above the baseline accuracy of 50.6%.
The confusion matrix for the logistic regression model shows that there were 26 false positives and 45 false negatives. This means that 26 people were expected to get treatment for mental health, but they did not, while 45 people were expected to not get treatment, but they did. The false negatives would be more dangerous, since the company might allocate less resources to employees when in reality more resources are necessary.
Neural Network:
A neural network is a method in artificial intelligence that teaches computers to process data in a network fashion inspired by the human brain and the interconnected neurons within it. In order to train a neural network on the data, I did a lot of the same steps I took before creating the logistic regression model. I first split the X and Y values, with Y being the treatment column and X being the rest of the columns. I also used one-hot encoding. But unlike the logistic regression model, each column in X was normalized to be between 0 and 1. Putting all of the variables on the same scale helps with stability while training the neural network.
Then, I performed a train/test split on the X and Y values with a test size of 0.2. I decided to create a simple neural network with a few layers, and after some testing, this architecture led to the highest accuracy.
The test accuracy for this model was approximately 71.9%, which is 21.3% above the baseline accuracy.
Next week, I will train multiple neural network architectures with different numbers of layer and nodes in order to find the highest possible accuracy.
Thank you for reading!
Sources:
- Open Sourcing Mental Illness, LTD. “OSMI Mental Health In Tech Survey 2016.” Kaggle.Com, 2016, Www.Kaggle.Com/Datasets/Osmi/Mental-Health-In-Tech-2016.
- Lawton, George, Et Al. “Logistic Regression.” Business Analytics, TechTarget, 2022, Www.Techtarget.Com/Searchbusinessanalytics/Definition/Logistic-Regression.
- Brownlee, Jason. “Ordinal And One-Hot Encodings For Categorical Data – MachineLearningMastery.Com.” MachineLearningMastery.Com, 11 June 2020, Machinelearningmastery.Com/One-Hot-Encoding-For-Categorical-Data/.
- “What Is A Neural Network? – Artificial Neural Network Explained – AWS.” Amazon Web Services, Inc., 2023, Aws.Amazon.Com/What-Is/Neural-Network/#:~:Text=A%20neural%20network%20is%20a,That%20resembles%20the%20human%20brain.