An unsupervised learning technique that groups data points into clusters based on similarity — without any predefined labels — so that points within a cluster are more alike than points in different clusters.
In Depth
Clustering is the most widely used unsupervised learning technique. Given a dataset with no predefined labels, a clustering algorithm discovers natural groupings by analyzing how similar data points are to each other. The algorithm assigns each point to a cluster such that intra-cluster similarity is high and inter-cluster similarity is low. Unlike classification, where categories are known in advance, clustering reveals structure that was previously hidden in the data.
The most popular clustering algorithms differ fundamentally in how they define and find clusters. K-Means assigns each point to the nearest of K centroids and iterates until convergence — it is fast and simple but assumes spherical, equally-sized clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies dense regions and can find clusters of arbitrary shape while automatically detecting outliers. Hierarchical Clustering builds a tree of nested cluster merges or splits, producing a dendrogram that reveals structure at multiple scales. Gaussian Mixture Models use probability distributions to model soft cluster membership.
Choosing the right clustering algorithm and number of clusters is both a science and an art. Metrics like the silhouette score, Davies-Bouldin index, and the elbow method help evaluate cluster quality, but domain knowledge is essential for interpreting whether discovered clusters are meaningful. A common pitfall is forcing data into clusters when no natural grouping exists — clustering will always produce groups, but those groups are not always meaningful.
Clustering reveals hidden structure in unlabeled data by grouping similar points together — a foundational technique for customer segmentation, anomaly detection, and exploratory data analysis.