# Class

Let's color predicted class in green and Actual Class in blue. A predicted class is what the judge did with a defendant - sent to prison or set free. Actual Class is what was actually true - if a person was actually guilty or innocent

You now see that there are 18 people whom the judge sent to prison, given that they were indeed guilty. Similarly, there are 19 people the judge set free, given that these people were indeed innocent. However, there are 7 people the judge sent to prison although these people were actually innocent, and 6 people that were set free by the judge although, in reality, they were guilty. So, the judge correctly classified 37 people (18 + 19), and incorrectly classified only 13 people (7 + 6).

This table is called Confusion Matrix - a table that describes the performance of a classification model (or "classifier") on a set of test data for which the true values are known. We can now calculate the accuracy of this "classifier", the judge, using the following formula:

It turns out that the judge's accuracy is:

The judge is right for 74% of the cases. That is a relatively low score for someone managing people's lives. We will get back to the Accuracy later in this article. Let's now introduce some key terminology for the Confusion Matrix.

2.   Confusion Matrix Terms & Accuracy

You let's go over terminology in Confusion Matrix. We have four terms we need to outline:

• True Positive

• False Negative

• True Negative

• False Positive

"True" or "False" is related to if the prediction was made correctly or not. "Positive" or "Negative" is related to if the prediction was classified to "positive" or "negative" class. Going back to our example with the judge, you now see that:

• 18 people whom the judge sent to prison, given that they were actually guilty are called true positives  - "true" because they were correctly classified, and "positive" because they were classified to a positive class ("guilty").

• 19 people the judge set free, given that these people were actually innocent are called true negatives  - "true" because they were correctly classified, and "negative" because they were classified to a negative class ("not guilty").

• 7 people the judge sent to prison although these people were actually innocent are called false positives  - "false" because they were incorrectly classified, and "positive" because they were classified to a positive class ("guilty") by the judge.

• 6 people that were set free by the judge although, in reality, they were guilty are called false negatives  - "false" because they were incorrectly classified, and "positive" because they were classified to a negative class ("not guilty") by the judge.

is true

is true

accept

accept

# Class

is innocent

is not innocent (is guilty)

Again, we know that correctly classified points are True Positives and True Negatives. Thus, the formula for Accuracy score that we calculated above is:

False Negatives and False Positives have the opposite effect on each other. For example, if we want to minimize false positives, false negatives will be maximized, and vice versa. Let's explain this relationship using judge's example. Let's say the judge now wants to avoid sending innocent people to prison at all costs (that is, to minimize false positive). Therefore, if there is even a slight chance, let's say 1%, that the person is innocent, the judge will set him free, even if other 99% screams that the person is guilty. Thus, our false positive rate will significantly decrease. But at the same time, there will be much higher chances of the judge setting free the guilty men. Thus, our false positive rate will significantly increase. In the end, while minimizing false positive, false negative will be maximized.

What if the judge now does not want at all costs to have guilty people walking down the streets of his city (that is, to minimize false negative)? Then, he is going to send to prison anyone who might have even a small chance of being guilty. In this case, there will be a lot of people sitting in prisons, and with this number, the number of innocent people that were sent to prison would also increase. In the end, while minimizing false negative, false positive will be maximized.

You may have gotten a perception that false positives and false negatives are the errors our classifier makes. These errors also have their names in statistics: Type I  and Type II errors.

• Type I error is a false positive rate and it occurs when we reject our null hypothesis (accept our alternative hypothesis), given that the null hypothesis is true.

• Type II error is a false negative rate and it occurs when we accept our null hypothesis, given that the null hypothesis is false.

If you would like to revise Type I and Type II errors, read our explanation in Inferential Statistics section "Categorical Data".

3.   False Positives vs False Negatives

We can deduce now that we are interested in minimizing either a false negative or a false positive. Let's provide two examples where we are in favor of minimizing either of them:

1. Judge Case: You are a judge and you decide if a person is guilty or innocent, based on your impression. There is a chance that you can make a wrong decision that a person is guilty and send him to prison, although he was actually innocent (false positive). On the contrary, there is a chance that you can make a wrong decision that a person is innocent and set him free, although he was guilty (false negative). Since, by common sense, it is much more important to avoid sending an innocent person to prison rather than setting a guilty person free, we are interested in minimizing false positives.

2. Cancer Case: You are a medic and you decide if a person is sick or healthy, based on his test results. There is a chance that you can make a wrong decision that a person is sick and send him to re-test, although he was actually healthy (false positive). On the contrary, there is a chance that you can make a wrong decision that a person is healthy and send him home, although he was sick (false negative). Since, by common sense, it is much more important to avoid sending a sick person home rather than making a healthy person re-take a test, we are interested in minimizing false negatives.

Let's draw a confusion matrix for each of two cases:

# Class

We always want to have as many True Positives and True Negatives as possible. Thus, if we have a case such as the Judge case, where it is important to minimize False Positives, we will use Precision as an evaluation technique. On the contrary, if we have a case such as the Cancer case, where it is important to minimize False negatives, we will use Recall as an evaluation technique. Let's explain these techniques more in-detail.

4.   Precision & Recall

Whatever we want to minimize, we always have True Positives in the numerator and the sum of True Positives and False Positives/False Negatives in the denominator. Our goal is to maximize this ratio (fraction). Let's show what we mean:

# Class

Precision

Out of all points that were predicted as positive ("send to prison"), how many of them were actually positive ("guilty")?

Our goal is to maximize this fraction, e.g. we would like to have the numerator and the denominator as similar as possible to get the result close to 1. This occurs when we have a very small value of False Negatives. In other words, maximizing our Precision, we will minimize our False Negatives. Therefore, when we are interested in decreasing this error, we use Precision as an evaluation technique for our binary classifier, and the more the value of the Precision score, the better.

Let's assume we have Precision = 0.93. That means, when our classifier predicts a point as positive (e.g. predicts a person as guilty), it is correct for 93%. We can also say that 93% is the probability that Type I error does not occur.

Recall

Out of all points that were actually positive, ("sick"), how many of them were correctly predicted as positive ("send to re-test")?

Our goal is to maximize this fraction, e.g. we would like to have the numerator and the denominator as similar as possible to get the result close to 1. This occurs when we have a very small value of False Positives. In other words, maximizing our Recall, we will minimize our False Positives. Therefore, when we are interested in decreasing this error, we use Recall as an evaluation technique for our binary classifier, and the more the value of the Recall score, the better.

Let's assume we have Recall = 0.96. That means, out of all points that were actually positives, our classifier correctly predicts positive (e.g. predicts a person as sick) for 96%. We can also say that 96% is the probability that Type II error does not occur.

Recall is also known as:

• Sensitivity

• The true positive rate

• Probability of detection

5.   F1-score

In many problems, we prefer giving more weight to precision or to recall. However, in some cases, we also want to seek a balance between both of them. This is when F1-score comes in.

The F-score is often introduced as a harmonic mean of precision and recall (or positive predictive value and sensitivity), where

However, we might also use F0.5-score or F2-score, depending on what errors we seek to minimize, as shows the table below:

Precision

Recall

F 0.5

F 1

F 2

minimizing False Positives (Type I), maximizing False Negatives (Type II)

minimizing False Negatives (Type II), maximizing False Positives (Type I)

Thus, technically, we can use the formula to calculate any F-β:

(

# Let’s delve into the possible classification cases.

• True Positives (TP): number of positive examples, labeled as such.

• False Positives (FP): number of negative examples, labeled as positive.

• True Negatives (TN): number of negative examples, labeled as such.

• False Negatives (FN): number of positive examples, labeled as negative.

Now, let’s say we have a classifier trained to do spam filtering, and we got the following results:

# Class

Because we now keep 10 e-mails in the mailbox, although they are spam, they become false negatives and are added to 15 emails that we also kept.

We get accuracy = (0 + 125)/(0 + 125 + 0 + 25) = 83.3%. This looks crazy. We changed our model to a completely useless one, with exactly zero predictive power, and yet, we got an increase in accuracy.

This is called the accuracy paradox. When TP < FP, then accuracy will always increase when we change a classification rule to always output “negative” category. Conversely, when TN < FN, the same will happen when we change our rule to always output “positive”.

Accuracy paradox can also be seen through a simpler example. Imagine we have no predictors at all and just a flip of an unfair coin with probabilities (0.6, 0.4). Accuracy is maximized if we classify everything as the first class and completely ignore the 40% probability that any outcome might be in the second class.

Let's see another example. Imagine we have 1,000,000 bank transactions and we would like to classify fraudulent ones. Let's assume that our classifier predicted 990,000 correctly, which gives us a good 99% accuracy (990,000/1000000). However, if we look at incorrectly classified 10,000 we would say that is still a lot. Consequently, accuracy has its own pitfalls, especially when it comes to imbalanced data (see our section on "Imbalanced Datasets & how to deal with it").

6.2.   Difference between Accuracy & F1-score

We have previously seen that accuracy can be largely contributed by a large number of True Negatives which in most business circumstances we do not focus on much, whereas False Negative and False Positive usually has business costs (tangible & intangible). Thus F1-score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (a large number of Actual Negatives). Nevertheless, note that not only Accuracy, but Precision, Recall, and F1-score are all biased to (focused on) the majority class.

There are many other metrics for evaluating binary classification systems, and plots (e.g. ROC) are very helpful too. The point to be made is that you should not take any of them in an isolated way: there is no best way to evaluate any system, but different metrics give us different (and valuable) insights into how a classification model performs.

# 7.1. Precision-Recall Curve

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve.

# ​

A model with perfect skill is depicted as a point at [1.0,1.0]. An ideal model is represented by a curve that bows towards (1.00,1.00) above the flat red line of no skill.

# 7.2.   Receiver Operating Characteristic Curve (ROC)

A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve or ROC curve. It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0.

# The true positive rate (also called Recall, or Sensitivity) is calculated as:

It describes how good the model is at predicting the positive class when the actual outcome is positive.

The false positive rate is calculated as:

It is also called the false alarm rate as it summarizes how often a positive class is predicted when the actual outcome is negative. The false positive rate is also the inverted specificity (=1 - Specificity), where specificity is:

Put another way, ROC curve plots the false alarm rate versus the hit rate. Now it gets clear why the top left corner, where the false positive rate is zero and the true positive rate is 1, represents an ideal point. Thus, ROC tries to maximize the true positive rate while minimizing the false positive rate.

# For example, in logistic regression, the output can be the probability of customer churn, i.e., yes (or equals to 1).

This probability is a value between 0 and 1. Logarithmic loss (also known as log-loss) measures the performance of a classifier when the predicted output is a probability value between 0 and 1 (e.g. Logistic Regression case).

Log-loss quantifies the accuracy of a classifier by penalizing false classifications.

So, for example, if classifier predicts a probability of 0.13 (means it predicts class 0) when the actual label is 1, it would be bad and would result in a high log loss. Similarly, if classifier predicts a probability of 0.93 (means it predicts class 1) when the actual label is indeed 1, it would be good and would result in a low log loss.

# Thus, Log-loss increases as the predicted probability diverge from the actual label, and decreases as the predicted probability get closer to the actual label. The goal of our machine learning models is to minimize Log-loss value. A perfect model would have a log loss of 0.

To calculate the log-loss of a binary classifier, we first have to calculate log-loss for each raw using the log loss equation, which measures how far each prediction is, from the actual label. Then, we calculate the average log-loss across all rows of the test set.

1. Calculating log-loss for each raw using the log loss equation: