Week 2: Navigating the Waters of Clustering and Label Accuracy
March 9, 2024
Welcome to the latest update on my senior project, where the journey through data analysis continues. This week’s focus has been on further exploring K-means clustering and how it holds up against various representations of our data set.
In the previous installment, I clustered water usage data into seven categories. My aim was to observe whether a more granular approach might yield clearer distinctions between clusters. After experimenting with cluster counts ranging all the way from 1 to 300, I turned my attention back to the sweet spot—clusters numbered at seven.
Interestingly, while the majority of these clusters were well-defined, two overlapped significantly. This observation was confirmed by a stacked bar plot, which should have shown a clear segmentation of water appliance usage within each cluster but instead indicated some inconsistencies (Image 1 & 2).
This brings us to a crucial point: the integrity of our labels. The data was manually annotated, based on individuals’ accounts of their water usage. The margin for human error here cannot be ignored—it could be significant, enough to skew our model’s performance.
Upon inspecting the labels more closely, I discovered a discrepancy. The variability in the data was considerable, suggesting that the features identified by the supervised model might be too unique to certain appliances, rendering the labels ineffective for reliable model training.
To address this, I’ve taken a more hands-on approach to data collection, personally ensuring the accuracy of the labels. This new, meticulously gathered data will serve as the foundation for subsequent model evaluations.
The images included show that while the separation of some clusters appears evident at a glance, the underlying distribution of labels tells a different story—a reminder of the challenges we face in this field (Image 1 & 2).
Moving forward, I’ll be harnessing this refined data set to test the robustness of my models. Stay tuned as we strive for more precision in our clustering algorithms and a deeper understanding of our data.
And that’s the current snapshot of my project—equal parts challenging and exhilarating. Here’s to clearer insights in the weeks to come!
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.
swethabhattacharya says
Hello Tiya, Now that the pictures are showing, it makes more sense. Please provide more information about K-means clustering is? how it works? ~ Mrs. B