---
title: "Clustering"
author: "Prof. Eric A. Suess"
date: "February 28, 2024"
format: 
  revealjs:
    self-contained: true
---

## Introduction

Today we will discuss Clustering.

## Clustering

Clustering algorithms are for Unsupervised Learning .

There are **no labels** in the dataset. 

Clustering produces labels for similar groups in the data.

## Applications of Clustering

- segmenting customers
- identifying patterns that fall outside of known clusters
- simplify larger datasets
- useful for data visualization

## Clustering

Unlabeled examples are given a cluster label and inferred entirely
from the relationships within the data.

Clustering is **unsupervised classification**

Clustering produces ''new data''

## Clustering

A problem with clustering is that the class labels produced 
*do not have meaning*.

Clustering will tell you which groups of examples are closely
related but it is up to you to apply meaning to the labels.

## Semi-Supervised Learning

If we begin with unlabeled data, we can use **clustering**
to create **class labels**.  

From there, we could apply
a **supervised learner** such as **decision trees** to find the most
important predictors of these classes.

## k-means algorithm

The **k-means algorithm** is perhaps the most commonly 
used clustering method.

See the [CRAN Task View: Cluster Analysis & Finite Mixture Models](https://cran.r-project.org/web/views/Cluster.html)
for a list of all the packages R has related to Clustering 
and beyond.

## k-means algorithm

**k-means** is not **kNN**

The only similarity is that you need to specify a **k**.

The goal is to **minimize** the differences within each cluster
and to **maximize** the differences between clusters.

## k-means algorithm

The algorithm:

- starts with *k* random selected **centers/centroids**.
- assigns examples to an initial set of *k* clusters.
- it updates the assignments by adjusting the cluster boundaries
according to the examples that fall into the cluster.
- the process of updating and assigning occurs several times until
making changes no longer improves the cluster fit.

When using **k-means** it is a good idea to run the 
algorithm more than once to check the robustness of your findings.

## Using distance

As with kNN, k-means treats feature values as coordinates
in a multidimensional feature space.

**Euclidean distance** is used

$dist(x,y) =\sqrt{\sum_{i=1}^n (x_i-y_i)^2}$

Using this **distance function**, we find the distance between 
each example and each cluster center.  

The example is then 
assigned to the nearest cluster center.

## Using distance

Because we are again using a distance measure, we need  

- **numeric features**
- to **normalize** the features


## Choosing the appropriate number of clusters

We need to balance the number of clusters **k**, 
try not to over-fit the data.

Rule-of-thumb is to set **k** equal to $\sqrt{n/2}$.

Or use the **elbow method**

- homogeneity within clusters is expected to increase as 
additional clusters are added.
- heterogeneity will decrease with more clusters.

Pick **k** at the elbow.

## Other Clustering methods

There are many algorithms that can be used to cluster data:

- k-means  **kmeans**
- Model based clustering  **Mclust**
- Hierarchical clustering  **hclust**
- Density based clustering  **dbscan**

## Big Data and Parallel Processing

See the **bigmemory** and **biganalytics** packages in R 
for **k-means** on *very big data* using *parallel processing*.

See Mcrosoft's version of [MRO](https://mran.microsoft.com/open) and the RevoScaleR [rxKmeans](https://docs.microsoft.com/en-us/machine-learning-server/r/how-to-revoscaler-cluster)

## Example: 

The author gives an example of clustering teens using 
*social media data*.