Week 4: Unveiling Water Wastage: A Deep Dive into Leak Detection Through Data Analysis
March 22, 2024
This week unfolded as a journey into unanticipated challenges, revealing the intricate dance of errors and outliers within my dataset. In a recent endeavor, I embarked on processing my data, categorizing it into groups based on water meter readings that ceased fluctuating for a minute or longer. However, upon delving deeper into the analysis and padding phase, a startling discovery was made: the maximum count of water meter readings in the dataset reached an astonishing 1536.
Initially dismissing this figure as a potential outlier and contemplating its removal, I stumbled upon several similar instances of large counts. It wasn’t until a rigorous bout of analysis and debugging that a significant leak was unearthed, dating back to 2023 in our bathroom. This anomaly was starkly highlighted within my water meter reading groups, showcasing a continuous flow of water usage exceeding 12 hours. Although my project’s early stages did not prioritize addressing leaks, this revelation underscored the importance of integrating such considerations.
The dilemma then presented itself in the form of data processing methodologies: opting between padding and utilizing time series databases. The latter, while appealing, necessitates a new generator, complicating the incorporation of fresh data. Given the unsupervised nature of the task, padding with zeros might theoretically be inconsequential. The initial approach to identifying leaks relied on a simple count metric, positing that any group exceeding 100 readings likely indicated a leak. This method, however, proved inadequate, as frequent meter readings could inflate the count without necessarily indicating prolonged water use.
Transitioning to a timer metric, which flags any water usage exceeding 60 minutes as a leak, offered a more nuanced perspective. Yet, acknowledging the possibility of legitimate high water usage scenarios, such as lengthy showers, I am now contemplating two innovative approaches: employing k-means clustering with padding up to 1536 or preemptively removing all detected leaks and applying k-means clustering with padding up to 356. This experiment aims to discern whether leaks can be automatically identified within the clustering process or if a preliminary phase of leak removal is requisite for effective unsupervised learning. This exploration not only highlights the technical challenges of data analysis but also underscores the critical environmental issue of water wastage through unnoticed leaks.
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.
Aadya G. says
Great post, Tiya! I’m excited to see which method works better for you
Cindy Z. says
I like how much detail you put into narrating your discovery! Could you add pictures of what the 1536 meter readings looked like in your data?