[,1] [,2]
aa 1 1
bb 2 2
2024-01-31
We will begin discussing Classification using Nearest Neighbors.
According to the author, nearest neighbors classifiers are defined by their classifying of unlabeled observations/examples by assigning them the class of the most similar labeled observations/examples.
Distance is calculated in the feature space
Euclidean distance
In a data set with \(n\) variables/features, the Euclidean distance between observations/examples is computed as follows
\(dist(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + ... + (p_n - q_n)^2}\)
Distance between rows.
aa <- c(1,1)
bb <- c(2,2)
X <- rbind(aa,bb)
X
[,1] [,2]
aa 1 1
bb 2 2
Using the distance function in R.
dist(X)
aa
bb 1.414214
Direct calculation.
sqrt(sum((aa-bb)^2))
[1] 1.414214
The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff
Mean Squared Error
\(MSE(\hat{\theta}) = Var(\hat{\theta}) + Bias^2(\hat{\theta})\)
\(E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta} - E[\hat{\theta}])^2] + E[(E[\hat{\theta}] - \theta)^2]\)
If \(k\) is very large, nearly every training observation/example is represented in the final vote, the most common training class always has a majority of voters. The model would always predict the majority class. High Bias?
If \(k\) is small, potentially a single nearest neighbor will determine the final vote, then noise may influence the prediction. High Variance?
The best \(k\) values is somewhere in between.
min-max normalization
\(X_{new} = \frac{X - min(X)}{max(X) - min(X)}\)
z-score normalization
\(X_{new} = \frac{X - \mu}{\sigma}\)
dummy coding for nominal variables/features
Because no abstraction occurs. There is no model, so the method is considered to be a non-parametric learning method.
Diagnosing breast cancer with k-NN algorithm.
Using R…