Linear Regression

A simple supervised learning model that makes predictions on linear relationships.

Predictive task: real valued numbers
Parametric
No hyper parameters
Features have linear relationship to target variable

You predict a quantitative response Y on the basis of a single predictor variable X:

Y is approximately modeled as B[0] + B[1]X

The B values (coefficients or parameters) are two unknown constants that represent the intercept and slope. B[0] basically represents baseline behavior of Y.

By estimating the B values you can find an intercept and slope such that the resulting line is close to your real data. The most common way of doing this is minimizing the least squares criterion using mean squared error (MSE).

Other measures:

mean absolute error
mean percent absolute error
root mean squared error

Types of variables

real valued
categorical
ordinal (grade/age)
non-ordinal (gender/race)

To use non-ordinal variables you can encode them into binary or categorical variables (go as simple as possible).

Single variable

Let

y_hat[i] = B-hat[0] + B_hat[1]x[i]

Be the prediction for Y based on the ith value of X. Then e[i] = y[i] - y_hat[i] represents the ith residual, the difference between the ith observed response value and the ith response value that is predicted.

B[1] = Cov(X, Y)/Var(X)

B[0] = E[Y] - Cov(X, Y)/Var(x) E[X]

This resulting line is the least squares line which is different from the true population line that would be produced if you could observe all data. The bias is the drift between the least squares line and population line like in stats.

Multivariate case

MSE = ||y-xB||2

X^t(xB - y)

Model fitness

R^2 can be a measure of how well your model fits:

R^2 = 1 - RSS/TSS

Summary

How to determine coefficients

Least squares method to minimize the residual sum of squares. RSS, Mean squared error.

How well does the model fit

(high) R-squared derived with RSS and TSS

Also F statistic (TSS-RSS)/p / RSS(n-p-1) to determine if there is a significant variable in the model.

How significant are the coefficients

Standard error of the coefficients. T-score, (low) p-value and hypothesis testing to find if they are significant. Confidence intervals.

How well does the model predict on unseen data

Multi Linear Regression

Can also add higher order terms: polynomial regression

y = a[0] + a[1]x[1] + a[2]x[2] + ...
y = a[0] + a[1]x[1] + a[2]x[1]^2 + ...

The x values can be functions.

Monitor training and test error as you add terms and increase the complexity of terms in the model.

Assumptions

Linear regression works off some fundamental assumptions that are often broken in real world scenarios.

All predictors X[1] - X[p] are linear to Y

In general, predictors might be correlated.
There may be interactions between predictors.

Predictor correlation comes from:

redundant information
underlying effect (confounding/causality)
correlated in nature

Why it matters

Correlation over 0.7 is problematic.
Collinearity in features is problematic

Variance inflation factor detects multicollinearity.

VIF(B_hat[i]) = 1 / 1 - R^2[X[i]|X[-i]]

Bias-Variance Trade Off

Bias is if you are correct, variance is the spread of predictions.

As your model complexity increases, bias generally decreases while variability generally increases

MSE = Var(f) + bias(f)^2 + Var(Sigma)

You want to optimize for the minimum MSE value.

Feature selection

Forward selection

Add features one-by-one by maximizing the R^2 value of the model.

Backward selection

Start with model with all features and remove the one that has maximum p-value. Repeat until you reach a tolerance of the p-value.

Mixed selection

Add some values to max the R^2 and then remove highest p-value features.

(resembles forward selection but stops at some criteria for the p-value of features)