2024-01-31
Breast cancer screening allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The process of early detection involves examining the breast tissue for abnormal lumps and masses.
If a lump is found, cells are extracted and examined to determine if the mass is malignant or benign.
If machine learning could automate the identification of cancerous cells, that would be beneficial. It would speed up the process.
There might be less human subjectivity.
The UCI Machine Learning Repository is a where many datasets are posted that can be used for machine learning practice.
http://archive.ics.uci.edu/
Breast Cancer Wisconsin Diagnostic dataset
Measurements from digitized images of fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital image.
569 observations/examples/instances
32 variables/features/attributes
10 different characteristics measured.
Diagnosis M = malignant (cancer), B = benign (no cancer)
ID variable should be excluded. A perfect predictor. Over fitting.
Target usually needs to be a factor in R.
Since k-NN is dependent upon the measurement scale of the input features, normalization is used to transform the data to a common scale.
In R
function( )
lapply( )
as.data.frame( )
How well does the learner perform on a dataset of unlabeled data?
Next 100 measurements in the lab.
Or we can simulate this scenario.
Note the data is already in random order.
For the k-NN algorithm, the training phase involves no model fitting – the process of training a lazy learner like k-NN involves storing the input data in a structured format.
knn( ) function in the class package is used.
A test instance is classified by taking a “vote” among the k-Nearest Neighbors – specifically, this involves assigning the class of the majority of the k neighbors. A tie is broken at random.
The function knn( ) returns a factor vector of predicted classes for each row in the test data frame.
Evaluate how well the predicted classes match with the known values.
CrossTable( ) in the gmodels packages is used.
Prediction involves balancing between FP and FN
The z-score does not restrict between 0 and 1.
The z-score allows for more weight to be given to larger distances.
In this example, the results are not as good.
\(k\) of 1 is too small, increases FP
k-NN does not do any learning.
It simply stores the training data verbatim.
Unlabeled test observations/examples/instances/records are then matched to the most similar observations in the training set using a distance function, and then unlabeled examples are assigned to the label of its neighbors.