multiple linear regression
Table of Contents
Let there be \( p \) predictors. We model the response, \( Y \), as
\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_p X_p + \epsilon \tag{1} \end{equation}In an ideal scenario the predictors are uncorrelated. This allows us to make causal statements like, a certain change in \(X_j\) will result in a certain change in \(Y\).
But, in practice this is not the case and the predictors are correlated. This causes
- The variance of all the coefficients to increase, sometime dramatically.
- Interpretation becomes hazardous. When \(X_j\) changes, everything else changes.
Hence, the claims of causality should be avoided for observational data.
1. Is at least one predictor useful?
We can determine this using F-statistic.
\begin{equation} F = \frac{\frac{TSS - RSS}{P}} {\frac{RSS}{n - p - 1}} \end{equation}The \(\frac{TSS - RSS}{P}\) captures drop in training error per predictor when compared to no model (\(y = \bar{y}\)).
The \(\frac{RSS}{n - p - 1}\) captures the residual per each degree of freedom.
2. What are important predictors?
For a small number of predictors, \(P\), it is easy to compute performance of models using all the possible subset of predictors. But, as \(P\) grows the number of possible models grow exponentially, with \(2^P\).
A smart way of doing this is using forward selection and backward selection.
3. How to deal with categorical variables?
Dummy variables are used to handle categorical variables. If a categorical variable has \(K\) levels, we introduce \(K-1\) dummy variables. The category for which no dummy variable is included is referred to as the baseline (reference) category.
The \(K-1\) is required to avoid perfect multicollinearity.
4. Interaction effect
In multiple linear regression, it is assumed that the relationship between the response variable and the predictor variables is additive. But, in reality that is not the case. More details can be found at interaction effect.
5. Non-linear effects of predictors
Standard linear regression assumes straight line relationship between each predictor and outcome. But, the real world data often curves. For example, a car fuel efficiency in relation to horse power.
To capture this, we can extend standard linear regression by incorporating \( d \)-th order predictor terms. It is important to be cautious when selecting the value of \( d \). A large \( d \) may lead to overfitting of the training data.
For example (one predictor)
\begin{equation} Y = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_1^2 + \hat{\beta}_3 X_1^3 + \ldots + \hat{\beta}_d X_1^d \end{equation}