Supervised Learning: Lasso & Ridge Regression
1. Introduction to Lasso Regularization Term (L1)
LASSO - Least Absolute Shrinkage and Selection Operator - was first formulated by Robert Tibshirani in 1996. It is a powerful method that performs two main tasks: regularization and feature selection.
Let's look at the example of lasso regularization with linear models, where OLS method is used with its regularization term.
L1 penalty / Penalty Term / Regularisation Term
Fit training data well (OLS)
Keep parameters small
A trade-off between fitting the training data well and keeping parameters small
The LASSO method puts a constraint on the sum of the absolute values of the model parameters, the sum has to be less than a fixed value (upper bound, or t):
In order to do so, the method applies a shrinking (regularization) process where it penalizes the coefficients of the regression variables shrinking some of them to zero. During features selection process the variables that still have a non-zero coefficient after the shrinking process are selected to be part of the model. The goal of this process is to minimize the prediction error.
2. Parameter α
In practice, the tuning parameter α that controls the strength of the penalty assumes great importance. Indeed, when α is sufficiently large, coefficients are forced to be exactly equal to zero. This way, dimensionality can be reduced. The larger the parameter α, the more the number of coefficients are shrunk to zero. On the other hand, if α = 0, we have just an OLS (Ordinary Least Squares) regression.
There are many advantages of using the LASSO method.
First of all, it can provide a very good prediction accuracy, because shrinking and removing the coefficients can reduce variance without a substantial increase of the bias, this is especially useful when you have a small number of observation and a large number of features. In terms of the tuning parameter α we know that bias increases and variance decreases when α increases, indeed a trade-off between bias and variance has to be found.
Moreover, the LASSO helps to increase the model interpretability by eliminating irrelevant variables that are not associated with the response variable, this way also overfitting is reduced. This is the point where we are more interested in because in this paper the focus is on the feature selection task.
- where t is the upper bound for the sum of the coefficients.
4. Introduction to Lasso Regression
Lasso with linear models is called Lasso Regression. It is the model that describes the relationship between response variable Y and explanatory variables X. In the case of one explanatory variable, Lasso Regression is called Simple Lasso Regression while the case with two or more explanatory variables is called Multiple Lasso Regression.
Lasso Regression holds all the assumptions of the Linear Regression, such as:
The response variable is normally distributed
There is a linear relationship between the response variable and the explanatory variables
The random errors are normally distributed, have constant (equal) variances at any point in X, and are independent
To read more about Linear Regression assumptions, go to Linear Regression.
5. The Model
The LASSO minimizes the sum of squared errors, with an upper bound on the sum of the absolute values of the model parameters. The lasso estimate is defined by the solution to the L1 optimization problem:
where t is the upper bound for the sum of the coefficients, n is the number of response variables and α ≥ 0 is the parameter that controls the strength of the penalty, the larger the value of α, the greater the amount of shrinkage. This optimization problem is equivalent to the parameter estimation that follows
The relation between α and the upper bound t is a reverse relationship. We already know that α controls the strength of the penalty. When α is large, coefficients are forced to be exactly equal to zero, and when α = 0, we have just an OLS (Ordinary Least Squares) method, which estimates parameters.
When t becomes close to 0, let's say 0.00001 (meaning that the absolute sum of all coefficients should be less than 0.00001), α goes to infinity as it forces coefficients to be exactly 0. On the contrary, as t becomes infinity (meaning that the absolute sum of all coefficients should be less than infinity), α becomes 0, as there is no urgency to shrink coefficients, so the problem becomes just an ordinary least squares.
6. Lasso Regression in Python
View/download a template of Lasso Regression located in a git repository here.
1. Introduction to Ridge Regularization Term (L2)
Ridge Regression uses OLS method, but with one difference: it has a regularization term (also known as L2 penalty or penalty term). This regularization term is trying to keep the parameters small and acts as a penalty on models with many large feature weight values. Therefore, if Ridge Regression finds two possible linear models that predict the training data values equally well, it chooses the one with smaller overall sum of squared features' weights. Thus, Ridge Regression is trying to fit the training data well by using OLS, and keep the parameters small by using a regularization term.
L2 penalty / Penalty Term / Regularisation Term
Fit training data well
Keep parameters small
A trade-off between fitting the training data well and keeping parameters small
Ridge Regression allows you to regularize coefficients. This means that the estimated coefficients are pushed towards 0, to make them work better on new data-sets ("optimized for prediction"). This allows you to use complex models and avoid over-fitting at the same time.
For Ridge Regression you have to set an α ("alpha") - a so-called "meta-parameter" (or "regularization parameter") that defines how aggressive regularization is performed. Alpha simply defines regularization strength and is usually chosen by cross-validation.
If α is too large, and thus , causing an underfitting.
Regularization works especially well when you have a relatively small amount of training data compared to the number of features in a model. It becomes less important as the amount of training data increases.
2. Feature Normalization
Feature scaling is very important in Ridge Regression: input variables with different scales will have different contributions to L2 penalty. Transform input features so that L2 penalty is applied more fairly to all features (without weighting some more than others just because of the difference in scales).
Fit the scaler using the training set, then apply the same scaler to transform the test set.
Do not scale the training and test sets using different scalers: this could lead to a random skew in the data
Note that the resulting model and the transferred features may be harder to interpret.
3. Ridge Regression in Python
View/download a template of Ridge Regression located in a git repository here.
Ridge VS. Lasso
In this section, the difference between Lasso and Ridge regression models is outlined. We assume you to know both Ridge and Lasso regressions described above.
Ridge regression is an extension for linear regression. It’s basically a regularized linear regression model. The α parameter is a scalar that should be learned as well, using a method called cross-validation.
An extremely important fact we need to notice about ridge regression is that it enforces the β coefficients to be lower, but it does not enforce them to be zero. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model.
The only difference from Ridge regression is that the regularization term is in absolute value. But this difference has a huge impact on the trade-off we’ve discussed before. Lasso method overcomes the disadvantage of Ridge regression by not only punishing high values of the coefficients β but actually setting them to zero if they are not relevant. Therefore, you might end up with fewer features included in the model than you started with, which is a huge advantage.
Keep in mind that Ridge regression can't zero out coefficients; thus, you either end up including all the coefficients in the model or none of them. In contrast, the LASSO does both parameter shrinkage and variable selection automatically. If some of your covariates are highly correlated, you may want to look at the Elastic Net instead of the LASSO.
2. Why Lasso Shrinks Coefficients
The main difference between ridge and lasso regression is a shape of their constraint regions. Ridge regression use 𝐿2 norm for a constraint. For P= 2 (where P is a number of regressors) case, the shape of the constraint region is a circle. Lasso uses 𝐿1 norm for a constraint. For P = 2 case, the shape of the constraint region is a diamond.
The elliptical contour plot in the figure represents sum of squares error term. The Lasso estimate is an estimate which minimizes the sum of squares as well as satisfies its "diamond" constraint. The Ridge estimate is an estimate which minimizes the sum of squares as well as satisfies its "circle" constraint.
Thus, the optimal point is a point which is a common point between an ellipse and L1/L2 constraint. This point tries to find the minimum for the constraint that will work for the regression model. Exactly that point gives a minimum value for the Ridge or Lasso function.
For the LASSO method the constraint region is a diamond, thus it has corners; Because it has corners, there is a high probability that optimum point (minimum point) falls in the corner point of the diamond region. For P=2 case, if an optimal point falls in the corner point, it means that one of the feature's estimate (𝛽𝑗=0) is zero.
For the RIDGE method the constraint region is a disk, thus it has no corners and the coefficients cannot be equal to zero, as point minimum will be located elsewhere.
Elastic Net is a method that includes both Lasso and Ridge.
The LASSO method has some limitations:
In small-n-large-p dataset (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates.
If there is a group of highly correlated variables, LASSO tends to select one variable from a group and ignore the others
To overcome these limitations, the elastic net adds a quadratic part to the L1 penalty, which when used alone is a ridge regression (known also as Tikhonov regularization or L2). The estimates from the elastic net method are defined by
2. Comparing L1 & L2 with Elastic Net
Consider the plots of the abs and square functions.
When minimizing a loss function with a regularization term, each of the entries in the parameter vector theta are “pulled” down towards zero. Think of each entry in theta lying on one the above curves and being subjected to “gravity” proportional to the regularization hyperparameter k. In the context of L1-regularization, the entries of theta are pulled towards zero proportionally to their absolute values — they lie on the red curve.
In the context of L2-regularization, the entries are pulled towards zero proportionally to their squares — the blue curve.
At first, L2 seems more severe, but the caveat is that, approaching zero, a different picture emerges
The result is that L2 regularization drives many of your parameters down, but will not necessarily eradicate them, since the penalty all but disappears near zero. Contrarily, L1 regularization forces non-essential entries of theta all the way to zero.
Adding ElasticNet (with 0.5 of each L1 and L2) to the picture, we can see it functions as a compromise between the two. One can imagine bending the yellow curve towards either red or blue by tuning the hyperparameter j.