Evaluation Matrices

For supervised learning algorithms

Binary Classification

1.   Introduction

Let's imagine a judge in a court making tough choices on who is innocent and who is guilty, based on his beliefs. He has multiple cases a day where he needs to make such decisions, and in every case, there are, as it usually happens, two lawyers - the one representing a defendant and the one accusing a defendant. The judge makes choices based purely on his impression of each case that is shaped by the layers, that is, what both layers say and what kind of evidence they hold against each other. For some cases, the judge does not know for sure if his decision is 100% accurate and if there are no mistakes during the day, but obviously, the judge tries his best to make sure he sends guilty people to prison and sets innocent people free. Thus, sometimes it happens that he sends innocent people to prison because he believes they are guilty although it is not actually true. Similarly, sometimes the judge sets people free believing they are innocent although they are actually guilty.

 

The judge spent the entire day deciding who was innocent and who was guilty and had a total of 50 defendants. Let's assume that now you are the God and you know who was actually guilty and who was actually innocent. Let us draw a table that outlines the judge's choices and how good or "accurate" they were. 

Guilty

Innocent

Actual Class

Send to prison

Set free

18

6

7

19

Predicted

Class

Let's color predicted class in green and Actual Class in blue. A predicted class is what the judge did with a defendant - sent to prison or set free. Actual Class is what was actually true - if a person was actually guilty or innocent

You now see that there are 18 people whom the judge sent to prison, given that they were indeed guilty. Similarly, there are 19 people the judge set free, given that these people were indeed innocent. However, there are 7 people the judge sent to prison although these people were actually innocent, and 6 people that were set free by the judge although, in reality, they were guilty. So, the judge correctly classified 37 people (18 + 19), and incorrectly classified only 13 people (7 + 6).

 

This table is called Confusion Matrix - a table that describes the performance of a classification model (or "classifier") on a set of test data for which the true values are known. We can now calculate the accuracy of this "classifier", the judge, using the following formula:

 

 

 

 

It turns out that the judge's accuracy is:

 

 

 

 

The judge is right for 74% of the cases. That is a relatively low score for someone managing people's lives. We will get back to the Accuracy later in this article. Let's now introduce some key terminology for the Confusion Matrix.

2.   Confusion Matrix Terms & Accuracy

You let's go over terminology in Confusion Matrix. We have four terms we need to outline:

  • True Positive

  • False Negative

  • True Negative

  • False Positive

 

"True" or "False" is related to if the prediction was made correctly or not. "Positive" or "Negative" is related to if the prediction was classified to "positive" or "negative" class. Going back to our example with the judge, you now see that:

 

  • 18 people whom the judge sent to prison, given that they were actually guilty are called true positives  - "true" because they were correctly classified, and "positive" because they were classified to a positive class ("guilty"). 

  • 19 people the judge set free, given that these people were actually innocent are called true negatives  - "true" because they were correctly classified, and "negative" because they were classified to a negative class ("not guilty"). 

  • 7 people the judge sent to prison although these people were actually innocent are called false positives  - "false" because they were incorrectly classified, and "positive" because they were classified to a positive class ("guilty") by the judge. 

  • 6 people that were set free by the judge although, in reality, they were guilty are called false negatives  - "false" because they were incorrectly classified, and "positive" because they were classified to a negative class ("not guilty") by the judge.

 

is true

is true

Guilty

Innocent

Actual Class

accept

accept

Send to prison

Set free

True Positives

False Negatives

False Positives

True Negatives

Predicted

Class

is innocent

is not innocent (is guilty)

Again, we know that correctly classified points are True Positives and True Negatives. Thus, the formula for Accuracy score that we calculated above is: 

False Negatives and False Positives have the opposite effect on each other. For example, if we want to minimize false positives, false negatives will be maximized, and vice versa. Let's explain this relationship using judge's example. Let's say the judge now wants to avoid sending innocent people to prison at all costs (that is, to minimize false positive). Therefore, if there is even a slight chance, let's say 1%, that the person is innocent, the judge will set him free, even if other 99% screams that the person is guilty. Thus, our false positive rate will significantly decrease. But at the same time, there will be much higher chances of the judge setting free the guilty men. Thus, our false positive rate will significantly increase. In the end, while minimizing false positive, false negative will be maximized.

 

What if the judge now does not want at all costs to have guilty people walking down the streets of his city (that is, to minimize false negative)? Then, he is going to send to prison anyone who might have even a small chance of being guilty. In this case, there will be a lot of people sitting in prisons, and with this number, the number of innocent people that were sent to prison would also increase. In the end, while minimizing false negative, false positive will be maximized.

 

You may have gotten a perception that false positives and false negatives are the errors our classifier makes. These errors also have their names in statistics: Type I  and Type II errors.

 

  • Type I error is a false positive rate and it occurs when we reject our null hypothesis (accept our alternative hypothesis), given that the null hypothesis is true. 

  • Type II error is a false negative rate and it occurs when we accept our null hypothesis, given that the null hypothesis is false. 

 

If you would like to revise Type I and Type II errors, read our explanation in Inferential Statistics section "Categorical Data".

3.   False Positives vs False Negatives

We can deduce now that we are interested in minimizing either a false negative or a false positive. Let's provide two examples where we are in favor of minimizing either of them:

  1. Judge Case: You are a judge and you decide if a person is guilty or innocent, based on your impression. There is a chance that you can make a wrong decision that a person is guilty and send him to prison, although he was actually innocent (false positive). On the contrary, there is a chance that you can make a wrong decision that a person is innocent and set him free, although he was guilty (false negative). Since, by common sense, it is much more important to avoid sending an innocent person to prison rather than setting a guilty person free, we are interested in minimizing false positives.

  2. Cancer Case: You are a medic and you decide if a person is sick or healthy, based on his test results. There is a chance that you can make a wrong decision that a person is sick and send him to re-test, although he was actually healthy (false positive). On the contrary, there is a chance that you can make a wrong decision that a person is healthy and send him home, although he was sick (false negative). Since, by common sense, it is much more important to avoid sending a sick person home rather than making a healthy person re-take a test, we are interested in minimizing false negatives.

 

Let's draw a confusion matrix for each of two cases:

Judge Case

Cancer Case

Guilty

Innocent

Predicted Class

Sick

Healthy

Predicted Class

Send to prison

Set free

True Positives

False Negatives

False Positives

True Negatives

Predicted

Class

Send to re-test

Send home

True Positives

False Negatives

False Positives

True Negatives

Predicted

Class

We always want to have as many True Positives and True Negatives as possible. Thus, if we have a case such as the Judge case, where it is important to minimize False Positives, we will use Precision as an evaluation technique. On the contrary, if we have a case such as the Cancer case, where it is important to minimize False negatives, we will use Recall as an evaluation technique. Let's explain these techniques more in-detail. 

4.   Precision & Recall

Whatever we want to minimize, we always have True Positives in the numerator and the sum of True Positives and False Positives/False Negatives in the denominator. Our goal is to maximize this ratio (fraction). Let's show what we mean:

Judge Case

Cancer Case

Guilty

Innocent

Predicted Class

Sick

Healthy

Predicted Class

Send to prison

Set free

True Positives

False Negatives

False Positives

True Negatives

Predicted

Class

Send to re-test

Send home

True Positives

False Negatives

False Positives

True Negatives

Predicted

Class

Precision

Out of all points that were predicted as positive ("send to prison"), how many of them were actually positive ("guilty")?

Our goal is to maximize this fraction, e.g. we would like to have the numerator and the denominator as similar as possible to get the result close to 1. This occurs when we have a very small value of False Negatives. In other words, maximizing our Precision, we will minimize our False Negatives. Therefore, when we are interested in decreasing this error, we use Precision as an evaluation technique for our binary classifier, and the more the value of the Precision score, the better. 

 

Let's assume we have Precision = 0.93. That means, when our classifier predicts a point as positive (e.g. predicts a person as guilty), it is correct for 93%. We can also say that 93% is the probability that Type I error does not occur.

Recall

Out of all points that were actually positive, ("sick"), how many of them were correctly predicted as positive ("send to re-test")?

Our goal is to maximize this fraction, e.g. we would like to have the numerator and the denominator as similar as possible to get the result close to 1. This occurs when we have a very small value of False Positives. In other words, maximizing our Recall, we will minimize our False Positives. Therefore, when we are interested in decreasing this error, we use Recall as an evaluation technique for our binary classifier, and the more the value of the Recall score, the better. 

 

Let's assume we have Recall = 0.96. That means, out of all points that were actually positives, our classifier correctly predicts positive (e.g. predicts a person as sick) for 96%. We can also say that 96% is the probability that Type II error does not occur.

Recall is also known as:

  • Sensitivity

  • The true positive rate

  • Probability of detection

5.   F1-score

In many problems, we prefer giving more weight to precision or to recall. However, in some cases, we also want to seek a balance between both of them. This is when F1-score comes in.

The F-score is often introduced as a harmonic mean of precision and recall (or positive predictive value and sensitivity), where

However, we might also use F0.5-score or F2-score, depending on what errors we seek to minimize, as shows the table below:

Precision

Recall

F 0.5

F 1

F 2

minimizing False Positives (Type I), maximizing False Negatives (Type II)

minimizing False Negatives (Type II), maximizing False Positives (Type I)

Thus, technically, we can use the formula to calculate any F-β:

(

 
 

6.   Accuracy vs F-measures

6.1.   Accuracy Pitfalls

We had talked about the idea of accuracy before, but have not actually defined what we mean by that. It is intuitively easy of course: we mean the proportion of correct results that a classifier achieved. If, from a data set, a classifier could correctly guess the label of half of the examples, then we say it’s accuracy was 50%. It seems obvious that the better the accuracy, the better and more useful a classifier is. But is it so?

 

Let’s delve into the possible classification cases.

  • True Positives (TP): number of positive examples, labeled as such.

  • False Positives (FP): number of negative examples, labeled as positive.

  • True Negatives (TN): number of negative examples, labeled as such.

  • False Negatives (FN): number of positive examples, labeled as negative.

 

Now, let’s say we have a classifier trained to do spam filtering, and we got the following results:

Spam

Not spam

Actual Class

Send to trash

Keep in

10

15

25

100

Predicted

Class

In this case, accuracy = (10 + 100)/(10 + 100 + 25 + 15) = 73.3%.  We may be tempted to think our classifier is pretty decent since it detected nearly 73% of all the spam messages. However, look what happens when we switch it for a "dumb" classifier  that always says “no spam” to any letter:

Spam

Not spam

Actual Class

Send to trash

Keep in

0

25

0

125

Predicted

Class

Because we now keep 10 e-mails in the mailbox, although they are spam, they become false negatives and are added to 15 emails that we also kept. 

We get accuracy = (0 + 125)/(0 + 125 + 0 + 25) = 83.3%. This looks crazy. We changed our model to a completely useless one, with exactly zero predictive power, and yet, we got an increase in accuracy.

 

 

This is called the accuracy paradox. When TP < FP, then accuracy will always increase when we change a classification rule to always output “negative” category. Conversely, when TN < FN, the same will happen when we change our rule to always output “positive”.

Accuracy paradox can also be seen through a simpler example. Imagine we have no predictors at all and just a flip of an unfair coin with probabilities (0.6, 0.4). Accuracy is maximized if we classify everything as the first class and completely ignore the 40% probability that any outcome might be in the second class. 

 

Let's see another example. Imagine we have 1,000,000 bank transactions and we would like to classify fraudulent ones. Let's assume that our classifier predicted 990,000 correctly, which gives us a good 99% accuracy (990,000/1000000). However, if we look at incorrectly classified 10,000 we would say that is still a lot. Consequently, accuracy has its own pitfalls, especially when it comes to imbalanced data (see our section on "Imbalanced Datasets & how to deal with it").

 

6.2.   Difference between Accuracy & F1-score

We have previously seen that accuracy can be largely contributed by a large number of True Negatives which in most business circumstances we do not focus on much, whereas False Negative and False Positive usually has business costs (tangible & intangible). Thus F1-score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (a large number of Actual Negatives). Nevertheless, note that not only Accuracy, but Precision, Recall, and F1-score are all biased to (focused on) the majority class. 

There are many other metrics for evaluating binary classification systems, and plots (e.g. ROC) are very helpful too. The point to be made is that you should not take any of them in an isolated way: there is no best way to evaluate any system, but different metrics give us different (and valuable) insights into how a classification model performs.

7.   Precision-Recall & ROC Curve

Two diagnostic tools that help in the interpretation of probabilistic forecast for binary (two-class) classification predictive modeling problems are ROC Curve and Precision-Recall curves. 

7.1. Precision-Recall Curve

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve. 

 

 

 

 

 

 

The red line is defined by the total number of positive cases divided by the total number of positive and negative cases. For a dataset with an equal number of positive and negative cases, this is a straight line at 0.5. Points above this line show skill.

A model with perfect skill is depicted as a point at [1.0,1.0]. An ideal model is represented by a curve that bows towards (1.00,1.00) above the flat red line of no skill.

 

Precision

Recall

7.2.   Receiver Operating Characteristic Curve (ROC)

A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve or ROC curve. It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0.

True Positive Rate (Sensitivity)

False Positive Rate (1-Specificity)

The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (guilty/innocent or sick/healthy).

 

Imagine our model is shown by the Z curve. An area under the Z curve is 0.5, which means our classifier can correctly distinguish two classes for 50% of the time. Let's go further and imagine we now have Y curve. AUC for Y is 0.65, which means our classifier can correctly distinguish two classes for 65% of the time. 

Similarly, if our classifier has an X curve, it correctly predicts two classes 80% of the time, as AUC is 0.8.

 

Thus, our ideal curve should be W, where the classifier correctly predicts our classes 100% of the time. To understand this logic, let's explain the true positive and the false positive rate more in detail. 

The true positive rate (also called Recall, or Sensitivity) is calculated as:

 

 

 

 

It describes how good the model is at predicting the positive class when the actual outcome is positive.

 

 

The false positive rate is calculated as:

 

It is also called the false alarm rate as it summarizes how often a positive class is predicted when the actual outcome is negative. The false positive rate is also the inverted specificity (=1 - Specificity), where specificity is:

 

 

 

 

Put another way, ROC curve plots the false alarm rate versus the hit rate. Now it gets clear why the top left corner, where the false positive rate is zero and the true positive rate is 1, represents an ideal point. Thus, ROC tries to maximize the true positive rate while minimizing the false positive rate. 

 

The probabilistic interpretation of ROC-AUC score is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. Here, rank is determined according to order by predicted values.

 

8.   Logarithmic Loss (Log-loss)

Sometimes, the output of a classifier is the probability of a class label, instead of the label

For example, in logistic regression, the output can be the probability of customer churn, i.e., yes (or equals to 1). 

 

This probability is a value between 0 and 1. Logarithmic loss (also known as log-loss) measures the performance of a classifier when the predicted output is a probability value between 0 and 1 (e.g. Logistic Regression case). 

 

Log-loss quantifies the accuracy of a classifier by penalizing false classifications.

So, for example, if classifier predicts a probability of 0.13 (means it predicts class 0) when the actual label is 1, it would be bad and would result in a high log loss. Similarly, if classifier predicts a probability of 0.93 (means it predicts class 1) when the actual label is indeed 1, it would be good and would result in a low log loss.

Thus, Log-loss increases as the predicted probability diverge from the actual label, and decreases as the predicted probability get closer to the actual label. The goal of our machine learning models is to minimize Log-loss value. A perfect model would have a log loss of 0.

 

To calculate the log-loss of a binary classifier, we first have to calculate log-loss for each raw using the log loss equation, which measures how far each prediction is, from the actual label. Then, we calculate the average log-loss across all rows of the test set.

    1. Calculating log-loss for each raw using the log loss equation:

You may have noticed that due to the dummy variables, a non-relevant part of equation is dropped, depending on the actual label (or class). It is also clear to see that when there is a considerable difference between the actual class and the predicted probability, the log-loss is high, while a small difference gives a small log-loss.

    2. Calculating the average log-loss across all rows of the test set:

0.258 is the result of our Log-loss. It's hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log-loss value means better predictions.

Thus, putting everything together, our formula for calculating Log-loss is the following: 

The beautiful thing about this formula is that it is intimately tied to information theory: log-loss is the cross entropy between the distribution of the true labels and the predictions, and it is very closely related to what’s known as the relative entropy, or Kullback–Leibler divergence. Entropy measures the unpredictability of something. Cross entropy incorporates the entropy of the true distribution, plus the extra unpredictability when one assumes a different distribution than the true distribution. So log-loss is an information-theoretic measure to gauge the “extra noise” that comes from using a predictor as opposed to the true labels. By minimizing the cross entropy, we maximize the accuracy of the classifier.

The graph above shows the range of possible loss values given a true observation (Churn = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, the log loss increases rapidly. Thus, Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!

The graph above shows the range of possible loss values given a true observation (Churn = 0). As the predicted probability approaches 0, log loss slowly decreases. As the predicted probability increases, the log loss increases rapidly. Thus, Log loss penalizes both types of errors, but especially those predictions that are confident and wrong!