Week 6: Navigating through a series of plots
April 9, 2024
This week was dedicated to tackling an error within my code, particularly concerning my water meter data, which exists in a 3D format with size (10132, 1536, 7). With 10132 being the number of inputs, each input has the size of 1536×7. 7 is the different time components and then the actual water meter reading: “month, day, year, hour, minute, second, VALUE”. and 1536 is the number of water meter readings in each. As mentioned in an earlier post I padded the values to fit this size. However the hurdle arose during k-means clustering, requiring the implementation of Principal Component Analysis (PCA) to handle the non-2D data for visualization.
For those not familiar with PCA, it’s a technique used in data analysis and machine learning to simplify complex datasets by reducing their dimensionality. Essentially, it finds the directions (or principal components) that capture the most variance in the data and projects the data onto these new axes. This can be particularly useful for visualizing high-dimensional data in a lower-dimensional space.
Working on a MacBook with an M1 processor posed a challenge, as the sklearn library, commonly used for this task, couldn’t be easily downloaded due to inherent limitations within the library itself. This limitation stems from the fact that the sklearn library relies on certain dependencies that may not be fully compatible with the architecture of the M1 processor.
To address this obstacle, my initial strategy involved manually coding the PCA function. However, I encountered discrepancies in results across different methods, leading to uncertainty regarding the accuracy of the visual output.
Seeking a reliable solution, I transitioned to Google Colab, which is an online platform that provides free access to computing resources, including GPUs and TPUs, along with pre-installed libraries like sklearn. By leveraging sklearn’s PCA and k-means functions in this environment, I was able to overcome the limitations imposed by my local setup.
Surprisingly, this endeavor produced yet another perplexing plot that was very obviously not accurate (image depicted below). While it would be lovely to have a plot like this, I know my data is not this perfect, no data is. Digging deeper into the issue, I realized the underlying problem: the data’s size was too large. Despite PCA’s attempts at dimensionality reduction, its effectiveness diminished with excessively large datasets. The size 1536×7 means there are 10752 parameters that are being reduced to 2 principal components. This is causing the obscure inaccurate visuals.
Consequently, I’ve opted to forego the visual representation of the k-means model and instead focus on individually analyzing each data point within its respective group in the upcoming week. This shift in approach will allow for a more granular examination of the data and potentially uncover insights that may have been obscured by the challenges encountered thus far.
Reader Interactions
Comments
Leave a Reply Cancel reply
You must be logged in to post a comment.
Cindy Z. says
It was nice to learn about why you decided to use Google Colab to run PCA! What made you realize that your M1 chip was not compatible? Was it an error you ran into downloading it, or was it from researching its compatibility?