Supervised Learning: Classification

Logistic Regression

1.   Introduction 

Logistic Regression despite the ”regression” term in its name is used in classification problems when the dependent (target) variable has two possible outcomes. However, this model can be extended to tackle multiclass classification problems, and we will discuss it at the end of this article.

2.   Key Terms 

Odds are used in Logistic Regression algorithm to model probabilities:

Figure 1:  Odds function

As you can see from formula (1), odds(p) ∈ [0;  ] given that p ∈ [0; 1]. However, we want our model to take a real value number from [- ; ] (as our features can have any values), and output a soft number in a range [0; 1] to describe a probability. Logistic function (also called Sigmoid) possesses all of these traits. It can be derived as an inverse of a log-odds function which is also called logit.

Figure 2:  Logit function

We can achieve the required properties by reflecting the logit function about the line y=x. This transformation can be performed by calculating the inverse of expression (2) which is called a logistic function:

In order to calculate that we should solve the equation:

Thus, the expression for logistic function (sigmoid function) is the following:

Figure 3:  Logistic function

3.   Model Training

Logistic Regression represents logit function as a linear combination of predictors plus the intercept:

where       is the value of       predictor and       is the generated coefficient. Coefficients indicate the effect of a one-unit change in the predictor variable on the log odds of ”success”

As our train data contains more than one observation, we will denote     as a column vector of the predictors’ values for the particular observation (we will also add 1 as its first element to account for an intercept term) and    as a column vector of coefficients                  :

Using this notation, we can rewrite the expression (4) as follows:

If we plug in               into formula (3), we will get an expression for the probability of a random variable Y (that represents the predicted output) being 0 or 1 given experimental data      and model parameters    :

As we are dealing with two class problem, the probability                                   can be expressed as follows:

We can combine probabilities used in expressions (6) and (7) into one formula:

One can notice that:

Our goal is to determine the coefficients                        from formula (4). The intuition here is that for any given train observation we want these coefficients to maximize the probability of observing a correct label. This sentence can be converted to the following formula (assuming train data is independently distributed):

This expression can be maximized through various optimization techniques such as Newton-Raphson algorithm or a gradient descent (which is usually applied to log-likelihood).

4.   Making Predictions

Now as we have the vector of model parameters   we can calculate the predicted value of the logit function for any new observation      (we will use hat symbol for predicted values):

Then we plug this value into logistic function in order to determine the probability of the data belonging to Class 1 (True, ”Yes”, etc):

The last step is to set up a threshold T ∈ [0;1] that will be used in order to make a prediction:

By default the threshold (T) is set up to 0.5, but you can adjust it based on your needs (usually based on the True Positive Rate and False Positive Rate trade-off).

Figure 4:  Making Prediction

5.   Regularization

Regularization means making the model less complex which can allow it to generalize better (i.e. avoid overfitting) and perform better on a new data.

 

As was mentioned above, the coefficients of logistic regression are usually fitted by maximizing the log-likelihood. As many optimization techniques are aimed at finding the minimum of a function we can redefine our goal as minimizing the negative loglikelihood:

We can penalize the model of having coefficients that are far from zero by adding a regularization term        multiplied by parameter      which is called regularization strength:

The two most popular regularizations are L1 and L2:

The factor    in L2 regularization is used to simplify the derivative calculations. Through    we can control the impact of the regularization term. Higher values of    lead to smaller coefficients (i.e. less regularization), but too high values can lead to underfitting. In scikit-learn package L2 regularization is used by default. Instead of regularization strength     , its inverse is used: the C parameter (the default is C=1.0). Similarly to    : smaller values of C lead to smaller coefficients, but too high values can lead to underfitting.

!

It is important to normalize the data before performing regularized logistic regression to ensure that the regularization term λ affects the coefficients in a similar manner.

6.   Logistic Regression For Multinomial Problems

Logistic regression can be generalized to handle problems with more than two possible outcomes. The most popular approach is called ”One-vs-Rest” logistic regression where we split our multinomial problem with M classes into M binary classification problems (see Figure 5).

3-class Classification Problem

3 Binary Classification Problems

Apple

Orange

Plum

Apple

Not Apple

Not Orange

Orange

Not Plum

Plum

Figure 5:  ”One-vs-Rest” Logistic Regression

In this case we generate different coefficients     for each binary classification problem (basically we train M separate Logistic Regression models). When we have to classify a new observation, we calculate the probabilities of the data belonging to each class (which are the outputs of our models) and select the class that has the highest probability.

7.   Logistic Regression in Python

View/download a template of Logistic Regression located in a git repository here.