# A trade-off between fitting the training data well and keeping parameters small

Ridge Regression allows you to regularize coefficients. This means that the estimated coefficients are pushed towards 0, to make them work better on new data-sets ("optimized for prediction"). This allows you to use complex models and avoid over-fitting at the same time.

For Ridge Regression you have to set an α ("alpha") - a so-called "meta-parameter" (or "regularization parameter") that defines how aggressive regularization is performed. Alpha simply defines regularization strength and is usually chosen by cross-validation.

If α is too large,                           and thus               , causing an underfitting.

# Ridge regression is an extension for linear regression. It’s basically a regularized linear regression model. The α parameter is a scalar that should be learned as well, using a method called cross-validation.

An extremely important fact we need to notice about ridge regression is that it enforces the β coefficients to be lower, but it does not enforce them to be zero. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model.

# Elastic Net is a method that includes both Lasso and Ridge.

The LASSO method has some limitations:

• In small-n-large-p dataset (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates.

• If there is a group of highly correlated variables, LASSO tends to select one variable from a group and ignore the others

To overcome these limitations, the elastic net adds a quadratic part to the L1 penalty, which when used alone is a ridge regression (known also as Tikhonov regularization or L2). The estimates from the elastic net method are defined by

# Consider the plots of the abs and square functions.

When minimizing a loss function with a regularization term, each of the entries in the parameter vector theta are “pulled” down towards zero. Think of each entry in theta lying on one the above curves and being subjected to “gravity” proportional to the regularization hyperparameter k. In the context of L1-regularization, the entries of theta are pulled towards zero proportionally to their absolute values — they lie on the red curve.

In the context of L2-regularization, the entries are pulled towards zero proportionally to their squares — the blue curve.

At first, L2 seems more severe, but the caveat is that, approaching zero, a different picture emerges

The result is that L2 regularization drives many of your parameters down, but will not necessarily eradicate them, since the penalty all but disappears near zero. Contrarily, L1 regularization forces non-essential entries of theta all the way to zero.

Adding ElasticNet (with 0.5 of each L1 and L2) to the picture, we can see it functions as a compromise between the two. One can imagine bending the yellow curve towards either red or blue by tuning the hyperparameter j.