Techniques that reduce the number of input features in a dataset while preserving as much meaningful information as possible — making data easier to visualize, faster to process, and less prone to overfitting.
In Depth
Modern datasets often contain hundreds or thousands of features — pixels in an image, words in a document, gene expressions in a genomic study. Working with so many dimensions creates challenges known as the 'curse of dimensionality': distances between points become less meaningful, models require exponentially more data to generalize, and computation becomes expensive. Dimensionality reduction addresses these problems by projecting data into a lower-dimensional space while retaining the most important patterns and relationships.
Principal Component Analysis (PCA) is the most widely used linear dimensionality reduction technique. It identifies the directions (principal components) along which the data varies most and projects data onto those directions, discarding dimensions that contribute little variance. For non-linear data, techniques like t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) create low-dimensional representations that preserve local neighborhood structure — making them ideal for visualizing clusters in high-dimensional data.
Dimensionality reduction serves two related but distinct purposes: feature preprocessing and data visualization. As preprocessing, it can improve model performance by removing noise and redundant features, reduce training time, and mitigate overfitting. As visualization, it allows humans to see patterns in data that exists in hundreds of dimensions by projecting it into 2D or 3D. Autoencoders — neural networks that learn to compress and reconstruct data — provide a powerful non-linear alternative that can capture complex, hierarchical relationships.
Dimensionality reduction compresses high-dimensional data into fewer meaningful features — fighting the curse of dimensionality and enabling both better models and human-interpretable visualization.