March 3, 2021


Today we will discuss Clustering.


Clustering algorithms are for Unsupervised Learning .

There are no labels in the dataset.

Clustering produces labels for similar groups in the data.

Applications of Clustering

  • segmenting customers
  • identifying patterns that fall outside of known clusters
  • simplify larger datasets
  • useful for data visualization


Unlabeled examples are given a cluster label and inferred entirely from the relationships within the data.

Clustering is unsupervised classification

Clustering produces ''new data''


A problem with clustering is that the class labels produced do not have meaning.

Clustering will tell you which groups of examples are closely related but it is up to you to apply meaning to the labels.

Semi-Supervised Learning

If we begin with unlabeled data, we can use clustering to create class labels.

From there, we could apply a supervised learner such as decision trees to find the most important predictors of these classes.

k-means algorithm

k-means algorithm

k-means is not kNN

The only similarity is that you need to specify a k.

The goal is to minimize the differences within each cluster and to maximize the differences between clusters.

k-means algorithm

The algorithm:

  • starts with k random selected centers/centroids.
  • assigns examples to an initial set of k clusters.
  • it updates the assignments by adjusting the cluster boundaries according to the examples that fall into the cluster.
  • the process of updating and assigning occurs several times until making changes no longer improves the cluster fit.

When using k-means it is a good idea to run the algorithm more than once to check the robustness of your findings.

Using distance

As with kNN, k-means treats feature values as coordinates in a multidimensional feature space.

Euclidean distance is used

\(dist(x,y) =\sqrt{\sum_{i=1}^n (x_i-y_i)^2}\)

Using this distance function, we find the distance between each example and each cluster center.

The example is then assigned to the nearest cluster center.

Using distance

Because we are again using a distance measure, we need

  • numeric features
  • to normalize the features

Choosing the appropriate number of clusters

We need to balance the number of clusters k, try not to over-fit the data.

Rule-of-thumb is to set k equal to \(\sqrt{n/2}\).

Or use the elbow method

  • homogeneity within clusters is expected to increase as additional clusters are added.
  • heterogeneity will decrease with more clusters.

Pick k at the elbow.

Other Clustering methods

There are many algorithms that can be used to cluster data:

  • k-means kmeans
  • Model based clustering Mclust
  • Hierarchical clustering hclust
  • Density based clustering dbscan

Big Data and Parallel Processing

See the bigmemory and biganalytics packages in R for k-means on very big data using parallel processing.

See Mcrosoft's version of MRO and the RevoScaleR rxKmeans


The author gives an example of clustering teens using social media data.