Extensions To Linear Models

Interactions

In statistics, an interaction is a particular property of three or more variables, where two or more variables interact in a non-additive manner when affecting a third variable. In other words, the two variables interact to have an effect that is more (or less) than the sum of their parts.

Polynomial regression

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(6)
X_fin = poly.fit_transform(X)

Bias/Variance Trade-Off

Underfitting and Overfitting

Let's formalize this:

Under-fitting happens when a model cannot model the training data, nor can it generalize to new data. happens when a model cannot model the training data, nor can it generalize to new data.

Our simple linear regression model fitter earlier was an under-fitted model.

Overfitting happens when a model models the training data too well. In fact, so well that it is not generalizable.

Ridge and Lasso Regression

Lasso and Ridge are two commonly used so-called regularization techniques. Regularization is a general term used when one tries to battle overfitting.

\text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p m_j^2

Ridge regression is often also referred to as L2 Norm Regularization

\text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p \mid m_j \mid

Lasso regression is often also referred to as L1 Norm Regularization

from sklearn.linear_model import Lasso, Ridge, LinearRegression

...

# Test Train Split
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)

# Ridge, Lasso regression model. 
# Note how in scikit learn, the regularization parameter is 
# denoted by alpha (and not lambda)
ridge = Ridge(alpha=0.5)
ridge.fit(X_train, y_train)

lasso = Lasso(alpha=0.5)
lasso.fit(X_train, y_train)

AIC and BIC

AIC ("Akaike's Information Criterion")

AIC(model) = -2 * log-likelihood(model) + 2 * (length of the parameter space)

BIC (Bayesian Information Criterion)

BIC(model) = -2 * log-likelihood(model) + log(number of observations) * (length of the parameter space)

Uses of the AIC and BIC

  • Performing feature selection: comparing models with only a few variables and more variables, computing the AIC/BIC and select the features that generated the lowest AIC or BIC

  • Similarly, selecting or not selecting interactions/polynomial features depending on whether or not the AIC/BIC decreases when adding them in

  • Computing the AIC and BIC for several values of the regularization parameter in Ridge/Lasso models and selecting the best regularization parameter.

  • Many more!

Last updated