2024-02-28
Today we will discuss Clustering.
Clustering algorithms are for Unsupervised Learning .
There are no labels in the dataset.
Clustering produces labels for similar groups in the data.
Unlabeled examples are given a cluster label and inferred entirely from the relationships within the data.
Clustering is unsupervised classification
Clustering produces ‘’new data’’
A problem with clustering is that the class labels produced do not have meaning.
Clustering will tell you which groups of examples are closely related but it is up to you to apply meaning to the labels.
If we begin with unlabeled data, we can use clustering to create class labels.
From there, we could apply a supervised learner such as decision trees to find the most important predictors of these classes.
The k-means algorithm is perhaps the most commonly used clustering method.
See the CRAN Task View: Cluster Analysis & Finite Mixture Models for a list of all the packages R has related to Clustering and beyond.
k-means is not kNN
The only similarity is that you need to specify a k.
The goal is to minimize the differences within each cluster and to maximize the differences between clusters.
The algorithm:
When using k-means it is a good idea to run the algorithm more than once to check the robustness of your findings.
As with kNN, k-means treats feature values as coordinates in a multidimensional feature space.
Euclidean distance is used
\(dist(x,y) =\sqrt{\sum_{i=1}^n (x_i-y_i)^2}\)
Using this distance function, we find the distance between each example and each cluster center.
The example is then assigned to the nearest cluster center.
Because we are again using a distance measure, we need
We need to balance the number of clusters k, try not to over-fit the data.
Rule-of-thumb is to set k equal to \(\sqrt{n/2}\).
Or use the elbow method
Pick k at the elbow.
There are many algorithms that can be used to cluster data:
See the bigmemory and biganalytics packages in R for k-means on very big data using parallel processing.
The author gives an example of clustering teens using social media data.