Week 7: Unraveling K-Means Clustering: Analyzing Time and Water Meter Readings
April 11, 2024
I worked on analyzing the results of the k-means of 7 clusters for both the model using time and water meter readings, and the model using just the water meter readings. As a reminder, the dataset contained water meter readings for each time an appliance was used (hypothetically). This process was automated, so no method can guarantee this was done with 100% accuracy. However, the analysis showed it was done with high performance. Anyhow, time could be another indicator for the appliance, so I tested the k-means with and without it. I want to see which model performs better.
Although the goal was to further analyze the results of the k-means last week, I found a big error. The size of the data was too large to reduce into 2 dimensions, causing a big problem with the plots, also hinting at a failure within the k-means algorithm itself.
I experimented to debug the problem. I tested the plots with all the data, including time. Size of torch.Size([10132, 1536, 7]). However, it was suggested that time and gallons might be pushing the model in the incorrect direction, so I tested the model without time and focusing just on gallons. Features might be more stable: torch.Size([10132, 1536, 1]).
The plots are shown below:
When I saw that these were the same, I knew something was wrong, especially since the data should have at least moved. The features are definitely not the same. Hence, I began to review each step and noticed something unfamiliar about the transformation of my tensor from 3D to 2D. Keep in mind the k-means algorithm can only take 2D data, so my data is reshaped. However, I was shaping it to the size of torch.Size([15562752, 1]), where it should instead be torch.Size([10132, 1536]).
Upon fixing this issue, I noticed my plots and clusters change accordingly. The left plot includes time and gallons, the right plot includes only gallons.
This method using PCA with k-means was earlier performed to ensure the PyCharm model was correct. Visually, it is the same plot inverted, hence meaning my algorithms within my other application still work, and I can revert to only one platform. However, for the ease of this project, I will continue to use Google Colab to display the various plots (sklearn is easily usable).
To the analysis of my two accurate mean plots: The plot with time seems to be more curved, and time massively affects the model. But there is no way to perform or prove this without looking at the individual data points within the clusters and testing whether or not they are in the same group.
Within this analysis, I came across yet another problem. My original data was nothing like my new data; it had undergone min-max normalization and was reshaped from 3D to 2D data. To uncover my data would require a lot of meticulousness, and I could not make a single mistake. Instead, I decided my workaround would be the indices. I would get the indices of the data points within the clusters and then take that same index from the original set of data in order to analyze it. Until next time, where more plots and graphs can be seen regarding the results of these endeavors.
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.
Cindy Z. says
The graphs look amazing!