Week 10: Developing a Better Clustering Strategy for Water Usage Analysis
May 4, 2024
This week, I made significant progress in the unsupervised segment of my project. As I mentioned last week, one of the major challenges I faced was the randomness of the data. It was chaotic, making it extremely difficult to determine which specific appliances—or even how many appliances—were running. To address this, I took the first step by reducing the data to a set of features that a model could use to identify the appliances.
After analyzing the dataset and researching how water experts determine which appliances are in use, I narrowed down my features to the following:
– Duration
– Total water consumption
– Mean flow rate
– Maximum flow rate
– Minimum flow rate
– Flow rate variance
– Flow rate changes
– Number of spikes
– Duration of high flow
– Time of day
I initially considered removing ‘time of day’ because it was a text-based item, but then decided to include it for the sake of completeness and later re-evaluate its impact on the model’s complexity.
Of these features, I expected that the mean flow rate would play a key role in my model’s ability to identify specific appliances. It tends to remain consistent for a given appliance, regardless of when it’s used. The main source of confusion occurs when multiple appliances run concurrently or within a similar timeframe. This is where I began developing my models.
Although I knew I would rely on GMM (Gaussian Mixture Model), which allows for softer clusters—a single data point can belong to more than one cluster—I initially attempted K-means clustering to test the “appliance theory.” However, I found that K-means wouldn’t handle outliers or heavy overlap well. This led me to switch to DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which is robust against noise and outliers and focuses on dense regions where water usage patterns are consistent.
DBSCAN requires two parameters: EPS, which represents the distance at which neighboring points start to significantly separate, and min_samples, which indicates the minimum number of samples within a cluster. Given that appliances should share some commonalities, the min_samples parameter wasn’t as critical in my context. To determine the optimal EPS, I plotted the distances and identified the elbow in the curve at 1.5, which became my chosen value for this parameter.
The results from the clustering models were promising. The clustering based on “mean flow rate” showed more diversity, suggesting a useful method for identifying different water appliances. However, I noticed that a lot of data points could belong to multiple categories, suggesting that I might need to revisit GMM for further analysis.
To validate my approach, I tested my models using newly collected data on simpler water appliances. The graph results were encouraging, especially when analyzing “mean flow rate,” showing improved diversity in clustering. The outcome was similar to the results from DBSCAN, with some points possibly belonging to two or more clusters. This indicated that the models were indeed flexible and could accommodate various scenarios of water appliance usage.
My goal for next week is to continue refining my models and validate my clustering approach with this new dataset. I also aim to test my models on additional data and explore whether reintroducing ‘time of day’ into the dataset adds value. With more controlled conditions and further analysis, I hope to gain deeper insights into identifying and categorizing various water appliances.
Leave a Reply
You must be logged in to post a comment.