Week 8: It’s Definitely Not Normal to Not Normalize
May 14, 2025
Hi Guys! I’ve just wrapped up week 8 of my internship, and this week was incredibly productive! The week started with me playing around with my models some more—testing them with different data sets, trying to get a better understanding of how it’ll perform! I tried playing around with the normalization technique after last week’s strange struggles with normalization and training/testing the model with different sets of data too see how the performance would vary with each test! I mentioned this briefly in the last post too, but turns out that LDA has a pretty high ‘variance’- a term in the machine learning world that indicates how closely a model’s performance is tied with its training dataset! The higher the variance of your model the lower your ‘bias’- the term that signifies the natural error that one can expect by attempting to simplify complex real-world relationships into simpler models. Au contraire, the lower the variance of your model the higher the bias! This variance-bias tradeoff is a pretty important concept in machine learning and relates very closely to the problem of over/under fitting! Typically, classification models, like LDA, have a higher variance since they define their decision boundaries based very closely on the data that its been presented! Meaning that the data that it has been presented to train on changes, than those decision boundaries could change pretty significantly! I saw this a lot when I tested my LDA model with different training/test data sets, especially since our datasets are smaller than typical classification model datasets that could contain upwards of 1000 data points! For example, if the model was only trained on binary mixtures, it would be pretty good at identifying binary mixtures with similar compositions to the ones it had seen before. However, it struggled A LOT with classifying tri-part mixtures, even if the percentage of the target chemical was something it had seen before. It also started confusing its binary mixture classification as well. When I normalized the data, it seemed to make the ROC curves that I was using to assess its performance worse, which wasn’t normal, so I met with Brendan to see if we could figure out a way to make this problem better! For the tests in which I normalized the data, the model was trained on binary mixtures and tested on tri part mixtures which might have been the reason it looked like normalization made it worse. Future note to self: if you’re going to make an adjustment to your method-like normalizing the data-keep the test/training sets constant before and after to see the actual result of normalizing! We found that normalizing the data and including a significant amount of tri-part mixtures (at least 3 mixtures (i.e. 9 tri-part spectra)) in the training data set seemed to really help with the overall classification skills. The area under the ROC curve went from 0.5 to around 0.8 for each of the targets, which is a big improvement! Although, the LDA’s high dependency on its training set may make it harder to work with for on field detection when it’s bound to come across cutting agents and percentages it’s never seen before!
I also tested out another model called Support Vector Regression(SVR) that is kind of like linear regression, but it introduces this concept of a ‘soft margin’-basically giving the model some set amount of slack for it to get some observations wrong to better fit the rest of the observations as a whole! Basically the ‘loss’ that this algorithm is trying to minimize looks a little different than in simple linear regression and depends pretty closely on the slack that the researcher decides to implement. When coding it in python using the scikitlearn package, this concept of ‘slack’ is denoted by the parameter C-and its usually best to find this parameter using cross-validation techniques (which scikitlearn also has many functions for). SVR can be implemented in different types of function too (like linear, polynomial, etc.) However, for this week I mostly explored linear SVR, and my results were certainly interesting! SVR seemed to perform better for mixtures that it had never come across before, which is not common for machine learning models! So far, SVR seems to have the most promising results of all the models I’ve been playing around with-it has the best ROC curves by far- so I’m excited to pitch it to our team!
Throughout the week I’ve developed a pretty good balance between lab work and computer work! I spend Mondays working on code, and then Tuesdays and Wednesdays working in the lab to get samples! I’m a triple threat in the lab now-prepping, running samples on the IR, and running them on the Raman! Honestly, I’ve come to really love my time in the lab! I feel super productive when I’m running through samples!
As I move forward, I’m probably going to start making a more coherent summary/presentation worthy description of all the model testing and development I’ve been doing for my final presentation!
That’s all for now!
See you next week!
Natasha
Leave a Reply
You must be logged in to post a comment.