Week 8-9: Creating The Model
May 1, 2023
Hey guys, welcome back to my project!
For the past few weeks I have been focusing on creating 4 different binary predictive models (Logistic Regression, Random Forest Model, Naive Bayes, XGBoost).
Here is a brief overview of the types of the customer’s data within my project:
Brief Rundown of the Models:
- Logistic Regression
- How It Works: Logistic Regression Estimates How Likely An Event Will Occur (In This Case Churn) Based Off Of The Given Dataset Of Independent Variables (Which Is The Customer’s Personal Information In This Case).
- Pros: Easier To Implement, Interpret, And Train
- Cons: Contructs Linear Boundaries And Can’t Solve Non Linear Problems
- Accuracy: 80%
- Random Forest
- How It Works: This Model Works By Creating Many Decision Trees To Solve A Problem And Chooses The Majority Decisions From The Decision Trees.
- Pros: Can Be Used For Both Classification And Regression Tasks And Not Influenced By Outliers To A Certain Degree Which Also Explains This Model’s Higher Accuracy Compared To The Other Models.
- Cons: Computationally Intensive For Large Datasets Like The One I Used For This Project
- Accuracy: 80%
- Naive Bayes
- How It Works: Uses Supervised Learning Algorithms Based On Applying Baye’s Theorem With The “Naive” Assumption That Each Feature Is Independent Of Each Other.
- Pros: Easy To Implement, Very Straight Foward Algorithm; Also Very Fast
- Cons: The Assumption That All Features Are Completely Independent Of Each Other Can Make The Model Rather Inefficient
- Accuracy: 75%
- XGBoost
- How It Works: Likewise To Random Forest, It Also Uses A Decision Tree Algorithm Whic Attempts To Accurately Predict A Target Variable By Combining The Estimates Of A Set Of Smaller Models.
- Pros: Highly Efficient, Flexible, And Portable
- Cons: Has Potential For Overfitting, Excessive Memory Usage, And Overcomplexity
- Accuracy 78%
Context:
Apparently, I found out that the dataset I downloaded wasn’t fully preprocessed which was an issue that had to be addressed. I fixed this problem by iterating through the entire dataset to check for invalid data and then removing the invalid data so that my code will be able to compile properly.
The dataset consists of over 7000 customers with their personal atributes which is included in the picture below:
However, only a few of these attributes (tenure, Monthly Charges, TotalCharges) are presented in Numerical Value and the majority of the them are presented in categorical data, which is a problem as the model requires numerical input in order to function. In order to address this problem, I converted the Categorical data into numerical values by returning floating point values or integers depending on their personal data. For example, if the customer is a male, we return 1 and if the customer is female, we return 0. This way, we are able to represent non-readable categorical in the form of numerical data that can be understood by our models.
I then proceeded to continue processing the data by creating a new column called “gender_clean” which basically includes the newly preprocessed data which has been completely cleansed of invalid data. I replicated this method for basically every other column throughout the entire dataset.
However, if the column has more than two possible values such as “MultipleLines” (which can have the options: “No” “Yes”, and “No phone service”), we may have to try a different method. Unlike columns such as “gender”, we can’t simply assign values of 0 and 1 to options “No” and “Yes” respectively. We also can’t just assign the value “2” to No phone service as having no phone service can say much more about a customer’s financial status rather than just having one phone service instead. So instead, we fix this probem by setting two new variables (as new columns within the dataset); “MultipleLines_no_phone_service” and “MultipleLines_yes”. For this operation, we are going to continue using binary values of only 0 and 1. For the method “convert_multiple_lines_1”, we return the value 1 if the option is “No phone service” and 0 if the options are “Yes” or “No”. We run this method on the original column “MultipleLines” in our dataset and assign the new values in the new column to the variale “MultipleLines_no_phone_service”. For the second method, “convert_multiple_lines_2”, we return the value 1 if the option is “Yes” and 0 if the options are “No” or “No phone service”. We run this method also on the original column “MultipleLines” and assign the new column to the value “MultipleLines_yes”. This way, we can effectively tell what type of phone service each customer has using binary values. For example, if both “MultipleLines_yes” and “MultipleLines_no_phone_service” have the value zero, then the customer has only one phone service available. If either “MultipleLines_yes” and “MultipleLines_no_phone_service” has the value 1, then the customer has that type of phone service depending on which column as the value 1. This same operation was also applied to other columns with more than two possible values.
Improvements that could be made:
Furthermore, some of the numerical data are imbalanced considering that the majority of customers within the training data didn’t churn, which may lower the accuracy of the model.
In the image shown above, 3583 people didn’t churn while only 1339 people churned (0 means didn’t churn and 1 means churned), highlighting the imbalance of data.
Another improvement that could be made is identifying the potential outliers within my dataset. As you can see from the boxplot from below, the data is very unevenly distributed as the Third percentile is very far away from the Maximum (4th) percentile. As a result, the Interquartile Range (IQR) is minimalized as Q3-Q1 is a very small number. Hence, the boundaries of the upper whisker in the data set (which is calculated by multiplying the IQR by 1.5) is less than the 4th percentile which may leave out potential outliers which may affect the model’s accuracy.
For my next and last post the following week, I will be talking about the specifics of each model that was created. I will be talking about how to interpret each of the 4 models such as how the f1-score was calculated and its accuracy.
Leave a Reply
You must be logged in to post a comment.