In this post, I will walk you through an application of the k-nearest neighbors algorithm to diagnose breast cancer.
In order to follow along, you will need R installed along with a couple of packages:
knnfunction which will be used in this post
As a reminder, you can install a package in R with the following command:
We will also need a dataset to apply the k-nearest neighbors algorithm. I chose the popular breast cancer data set which is publicy available and comes from the repository of machine learning datasets of UCI.
You can find the code for this post on GitHub.
Without doing any R programming, you can learn a lot from the dataset by reading the document presenting it.
For example, you can learn the names of the different features and what they represent, the total number of examples and how they are split between benign and malignant, etc.
We learn that there are actually 10 numeric features which are measured for each cell nucleus and for each of these features the mean, the standard error and the largest (or “worst”) values are stored in the dataset.
Another useful nugget of information in this document is that there are no missing attribute values in the dataset so we won’t have to do much data transformation which is great.
One thing you will notice when downloading the dataset is that there are no header lines in the CSV file.
Personally, I don’t like naming features in R and since the number of features (32) is manageable, I added the names of the features directly in the CSV file.
Consequently, the first line becomes:
The modified dataset can be found on GitHub.
We are now ready to load the data into R:
As you may have noticed from the dataset or the names of the features, there is
id column in the dataset and since it doesn’t contain any information it
should be removed from the dataset.
We will also need to transform our
diagnosis feature which contains “B” or “M”
depending on whether the cancer was benign or malignant into a factor to be able
to use the
knn function later.
Another thing we will need to is to randomly permute our examples in order to avoid any kind of ordering which might be already present in the dataset. This is important because if the dataset were ordered by diagnosis for example and if we were to split our dataset between a training set and a test set, as we will do later, the test set would be filled by either only benign or malignant tumors to which the algorithm wouldn’t have been confronted against during the training phase.
If you have a look at the range of the different mean features with:
You can see that there are ranges of features which are 1000 times bigger than others, this is a problem because since the kNN algorithm relies on distance measurement it will give more importance to features with larger values.
In order to prevent our algorithm from giving too much importance to a few
features, we will need to normalize them. One way to do so is to use the
You can check that the mean of every feature is null now:
We will have to split our dataset in order to train our algorithm (training set) and then evaluate its performance (test set).
A good rule of thumb is to take 75% of the dataset for the training and leave the rest as a test set. We learned from the dataset description that there were 569 examples in our dataset, so we’ll take the first 427 examples as our training set and the last 142 as our test set.
We will also need the diagnosis feature as a separate factor in order to use the
knn function which we’ll use later in this post.
We are now ready to use the
knn function contained in the
The function takes four arguments:
trainwhich is our training set
testthe training set
classwhich should be a factor containing the classes of the training set
kthe number of nearest neighbors to consider in order to predict the class of a new example
This function will give us back a factor containing the predicted class for
each example in the test set. Since we have already stored the actual classes of
our test set in the
wdbcTestLabels variable, we will be able to compare both
and evaluate the performace of the algorithm.
A good place to start when choosing k is to pick the square root of the number of examples in our training set. Here we have 427 examples in our training set, and so the square root of 427 is approximately 21. Keep in mind that you are not stuck with this value, and it is often a good idea to fiddle around with the value of k to see if we can get better results as we will see later in this post.
We are now ready to use the
You can have a pretty good idea of how your model is performing by computing the percentage of right predictions the algorithm made:
You will notice that we lose the factor class when we perform the
cbind and we
are left with 1 for Benign and 2 for Malignant. Here, it doesn’t really matter
for the computation of our percentage of right predictions.
I personally get around 96% of right predictions. This result depends on how “lucky” you were with your random permutation. If there are atypical examples contained only in the test set, your model didn’t learn from them and, consequently, they might have been misclassified.
Another way to evaluate the performance of your algorithm is to create a
CrossTable to check on how we misclassified our examples:
This is the table I get with my particular permutation, as previously
mentioned, your results may differ:
We can see that we predicted 6 examples as benign although they were malignant (false negatives).
Additionally, you can compute other indicators on how your algorithm is performing:
For example, you can measure the precision which represents of all examples where we predicted the tumor was malignant, what fraction of examples is actually malignant?
Here we see that we have perfect precision since we didn’t have any false positives.
We can also compute the recall:
The recall represents of all examples where the tumor was malignant, what fraction did we correctly detect as being malignant? That’s where our algorithm is not performing as well, we only detected 87.5% of malignant tumors as being malignant, which leaves us which 12.5% of malignant tumors which we classified as benign.
From the values of both the precision and recall we can compute the F1 score which measures our test’s accuracy, it can be thought of as a weighted average between precision and recall:
This score prevents us from classifying every test example as malignant in order to avoid false negatives (predict benign tumors although they are malignant) because in this case our precision would greatly decrease (due to the fact that we would have 94 false positives).
There are several ways you can improve the algorithm’s performance one of which is to toy around with the value of the parameter k. For example, one could imagine trying a few odd numbers around 21 and computing the percentage of misclassified examples as well as the number of false positives and false negatives. It’s important to choose an odd k to lessen the risk of randomly choosing between malignant or benign in case there is the same number of votes for each class. For example, if we were to choose 20 as k and 10 of the nearest neighbors of an example were malignant and the 10 others were benign, the algorithm would choose randomly between both.
I hope this was an interesting introduction to the k-nearest neighbors algorithm. Do not hesitate to email me if you’re having difficulties running or understanding the code.