Week 6: Dimensionality Reduction Approaches For AutoEncoder Representations
Welcome to Week 6 of my Senior Project Blog! This week, I will explore dimensionality reduction for the layer weights of Denoising Autoencoder (DAE) models that were built in our lab for Chromosome-22.
As we saw previously in prior weeks, most of the 256 DAE models built for the tile fragments of Chromosome-22 in our lab did not have the bottleneck layer or hourglass architecture and were most likely sparse encoders where the number of nodes in the hidden layers of the encoder and decoder segments did not change. So I wanted to examine the layer weights of the last encoder layer across various models and see if there is a more compact representation possible based on dimensionality reduction approaches.
Dimensionality reduction simply refers to the process by which we can reduce the number of attributes in a dataset while retaining as much of the variation in the original dataset as possible. It has numerous benefits such as reduced training time and fewer computational resources and increased overall performance of machine learning algorithms. In particular, when there are a lot of features (inputs) for the model, the data points can be much farther in the high-dimensional space making it a challenge to train models based on such data which is referred to as the curse of dimensionality. Dimensionality reduction can help mitigate this problem in some scenarios. Dimensionality reduction can also help us avoid the problem of overfitting when there are lots of features in the data and can be very useful to visualize the data in 2 or 3 dimensions. Dimensionality reduction techniques can also address the problems of multicollinearity when inputs are highly correlated with one another by combining them into a set of uncorrelated variables. Dimensionality reduction is very useful for factor analysis, which is a technique for identifying latent (unmeasured) variables called factors that can be inferred from other variables in the dataset. In particular, for my DAE-based imputation, I was curious to understand if dimensionality reduction can help remove noise in the data and identify the most important features and remove any redundant features to improve the model accuracy and increase model performance.
Some of the most commonly used dimensionality reduction methods include linear methods like Principal Component Analysis (PCA) and non-linear methods such as t-distributed Stochastic Neighbor Embedding (t-SNE), Multidimensional Scaling (MDS), and Uniform Manifold Approximation and Projection (UMAP). PCA is a linear dimensionality reduction algorithm that transforms a set of correlated model input variables into a smaller number of uncorrelated inputs called principal components while retaining as much of the variation in the original dataset as possible. When we have non-linear data as we frequently encounter in most real-world applications, linear methods like PCA do not perform well for dimensionality reduction. Multidimensional Scaling (MDS) is a non-linear dimensionality reduction technique that preserves distances between observations while reducing the dimensionality of non-linear data. t-SNE adapts to the underlying data, performing different transformations on different regions using a tuneable parameter, called “perplexity,” which tries to balance attention between local and global characteristics of your data which roughly equates to the guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures. t-SNE begins by converting the high-dimensional Euclidean distances between data points into conditional probabilities representing similarities and minimizes their difference between the high-dimensional and low-dimensional spaces. For my explorations with Chromosome-22 DAE models, I tried to use the UMAP technique.
To understand what UMAP is and how it works we need to first understand the meaning behind the 4 words Uniform, Manifold, Approximation, and Projection:
Projection: This means we are reducing a high-dimensional entity onto a plane, a curved surface, or a line by projecting its points.
Approximation: UMAP assumes we have a finite set of data points and not the entire set that makes up the “manifold” and approximates it based on the data available.
Manifold: This is a topological space (mathematically a geometrical space in which closeness is defined but cannot necessarily be measured by a numeric distance, that is, a set of points, along with a set of neighborhoods for each point that satisfy a concept of closeness) that locally resembles Euclidean space (a space in any finite number of dimensions, in which points are designated by coordinates and the distance between two points is given by a distance formula) near each point. One-dimensional manifolds include lines and circles, while 2-D manifolds can be planes, spheres, torus, etc.
Uniform: Uniformity assumes that data samples are uniformly (evenly) distributed across the manifold. This means that the distance varies across the manifold, for example, the space is warping (stretching or shrinking) due to which the data appear sparser or denser.
From the above, we can understand that UMAP is a non-linear dimensionality reduction technique which assumes available data samples are uniformly distributed across a manifold amd the manifold can be approximated from these data samples and projected to a lower-dimensional space!
UMAP algorithm has 2 major stages:
1. Manifold learning in high-dimensional space: UMAP starts by finding the nearest neighbors using the Nearest-Neighbor-Descent algorithm of Dong et al. Then it constructs a graph by connecting the nearest neighbors while ensuring that the manifold structure we are trying to learn does not result in many unconnected points.
2. Determination of low-dimensional representation of manifold: In the second stage, construction of low-dimensional presentation we minimize the distances on the manifold to be standard Euclidean distances with respect to a global coordinate system so that we control the minimum spread of points, and avoid having points sitting on top of each other in the lower-dimensional embedding. Then the algorithm minimizes a cost function called Cross-Entropy (CE) producing the optimal weights of edges in the low-dimensional representation using an iterative stochastic gradient descent process.
I used the UMAP Python implementation which is designed to be compatible with scikit-learn (uses the same API and can be added to sklearn pipelines) and can be used with standard plotting tools (matplotlib and seaborn) to help us visualize the results of UMAP. In the following code snippet, we first load the final decoder or encoder layer weights we had saved for the Chromosome-22 DAE models as outlined in Week 5. Then, we can run the UMAP() function and plot the resulting 2D representation of the embeddings across all the samples.
When I ran the above UMAP routine for the 256 DAE tile models for Chromosome-22, the results were a bit mixed. In some tile models, there was some clear clustering in the 2-D projection after UMAP where we could discern specific patterns with distinct clusters and some where there were no clear clusters as shown below.
In summary, I concluded that UMAP was not completely successful for the task of dimensionality reduction for our lab’s Chromosome-22 tile DAE models.
Now that I have explored dimensionality reduction with the embedding layers of the DAE models for Chromosome-22 tiles, I will next focus on analyzing CAD models that are built in our lab for further analysis.
Thank you for reading, and see you next week!
- Coenen, Andy, and Adam Pearce. “Understanding Umap.” Google People + AI Research (PAIR), Google Inc., https://pair-code.github.io/understanding-umap/.
- Dong, Wei, et al. “Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures.” Proceedings of the 20th International Conference on World Wide Web, 2011, https://doi.org/10.1145/1963405.1963487.
- McInnes, Leland, et al. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” ArXiv.org, 18 Sept. 2020, https://arxiv.org/abs/1802.03426.
- Pramoditha, Rukshan. “11 Dimensionality Reduction Techniques You Should Know in 2021.” Medium, Towards Data Science, 28 Sept. 2021, https://towardsdatascience.com/11-dimensionality-reduction-techniques-you-should-know-in-2021-dcb9500d388b.
- “Uniform Manifold Approximation and Projection for Dimension Reduction¶.” UMAP, https://umap-learn.readthedocs.io/en/latest/.