---
title: "Nearest Neighbors"
author: "Prof. Eric A. Suess"
date: "1/31/2024"
format: 
  revealjs:
    self-contained: true
---

## Introduction

We will begin discussing **Classification** using **Nearest Neighbors**.

According to the author, nearest neighbors classifiers are defined by their classifying of unlabeled observations/examples by assigning them the class of the most similar labeled observations/examples.

## k-NN algorithm

- **Training dataset** is made up of observations/examples that are classified into several categories, labeled by a nominal variable.
- **Test dataset** contains unlabeled observations/examples
- k-NN identifies $k$ records in the training data that are the **"nearest" in similarity**.
- The unlabeled test observations/examples are assigned to the class of the majority of the $k$ nearest neighbors.

## Distance

Distance is calculated in the feature space

- **Euclidean distance**
- **Manhattan distance**

Euclidean distance

In a data set with $n$ variables/features, the **Euclidean distance** between observations/examples is computed as follows

$dist(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + ... + (p_n - q_n)^2}$

## Distance Example

Distance between rows.

> aa <- c(1,1)

> bb <- c(2,2)

> X <- rbind(aa,bb)

> X

```{r}
aa <- c(1,1)
bb <- c(2,2)

X <- rbind(aa,bb)
X
```

## Distance Example

Using the distance function in R.

> dist(X)

```{r}
dist(X)
```

Direct calculation.

> sqrt(sum((aa-bb)^2))

```{r}
sqrt(sum((aa-bb)^2))
```

## Choosing k

The balance between *overfitting* and *underfitting* the *training data* is a problem known as the **bias-variance tradeoff**

Mean Squared Error

$MSE(\hat{\theta}) = Var(\hat{\theta}) + Bias^2(\hat{\theta})$

$E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta} - E[\hat{\theta}])^2]
   + E[(E[\hat{\theta}] - \theta)^2]$

## Choosing k

If $k$ is very large, nearly every training observation/example is represented in the *final vote*, the most common training class always has a majority of voters.  The model would always predict the majority class.  **High Bias?**

If $k$ is small, potentially a single nearest neighbor will determine the *final vote*, then noise may influence the prediction.  **High Variance?**

The best $k$ values is somewhere in between.

## Preparing the data 

- **min-max normalization**

    $X_{new} = \frac{X - min(X)}{max(X) - min(X)}$
 
- **z-score normalization**

    $X_{new} = \frac{X - \mu}{\sigma}$

- **dummy coding** for nominal variables/features

## Why is the k-NN algorithm lazy?

Because no abstraction occurs.  There is no model, so the method is considered to be a **non-parametric** learning method.

## Example

Diagnosing *breast cancer* with k-NN algorithm.

Using R...

- loading the data
- reading the data into R
- transforming the data
- training data
- testing data
- training the model
- evaluating the model
- improving the model