Practical Statistics for Data Scientists Chapter Four Regression and Prediction

Sisi (Rachel) Chen
8 min readDec 9, 2019

--

4.1 Simple Linear Regression

Simple linear regression models the relationship between the magnitude of one variable and that of a second. The difference between Regression and correlation is that while correlation measures the strength of an association between two variables, regression quantifies the nature of the relationship.

With regression, we are trying to predict the Y variable from
X using a linear relationship.

X: Independent variable, feature

Y: Dependent variable, target

b0 is known as intercept, b1 is known as the slope for X. In general, the data doesn’t fall exactly on a line, so the regression equation should include an explicit error term.

The fitted values also referred to as the predicted values, are typically denoted by (Y-hat). The notation b0-hat and b1-hat indicate that the coefficients are estimated versus known.

Least Squares

Minimize the sum of squared residual values = residual of squares or RSS

The method of minimizing the sum of the squared residuals is termed least squares regression or ordinary least squares.

REGRESSION TERMINOLOGY
In its formal statistical sense, regression also includes nonlinear models that yield a functional relationship between predictors and outcome variables. In the machine learning community, the term is also occasionally used loosely to refer to the use of any predictive model that produces a predicted numeric outcome.

4.2 Multiple Linear Regression

How to assess the multiple linear regression model?

The most important performance metric from a data science perspective is the root mean squared error, RSE, R square, Adjusted R, SE, and a t-statistic.

RMSE:

RSE: The same as root mean squared error but adjusted for degrees of freedom

R squared: the proportion of variance explained by the model, from 0 to 1, also called goodness of fit.

Adjusted R-squared: a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.

t- statistics and p-value:

The coefficient for a predictor, divide by the standard error of the coefficient, giving a metric to compare the importance of variables in the model.

The t statistic is the coefficient divided by its standard error. The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across cases. It can be thought of as a measure of the precision with which the regression coefficient is measured. If a coefficient is large compared to its standard error, then it is probably different from 0.

How large is large? Your regression software compares the t statistic on your variable with values in the Student’s t distribution to determine the P-value, which is the number that you really need to be looking at. The Student’s t distribution describes how the mean of a sample with a certain number of observations (your n) is expected to behave.

If 95% of the t distribution is closer to the mean than the t-value on the coefficient you are looking at, then you have a P-value of 5%. This is also referred to as a significance level of 5%. The P-value is the probability of seeing a result as extreme as the one you are getting (at value as large as yours) in a collection of random data in which the variable had no effect. A P of 5% or less is the generally accepted point at which to reject the null hypothesis. With a P-value of 5% (or .05) there is only a 5% chance that results you are seeing would have come up in a random distribution, so you can say with a 95% probability of being correct that the variable is having some effect, assuming your model is specified correctly.

The 95% confidence interval for your coefficients shown by many regression packages gives you the same information. You can be 95% confident that the real, underlying value of the coefficient that you are estimating falls somewhere in that 95% confidence interval, so if the interval does not contain 0, your P-value will be .05 or less.

Note that the size of the P-value for a coefficient says nothing about the size of the effect that variable is having on your dependent variable — it is possible to have a highly significant result (very small P-value) for a minuscule effect.

cross-validation
This idea of “out-of-sample” validation is not new, but it did not really take hold until larger data sets became more prevalent; with a small data set, analysts typically want to use all the data and fit the best possible model.

The algorithm for basic k-fold cross-validation is as follows:
1. Set aside 1/k of the data as a holdout sample.
2. Train the model on the remaining data.
3. Apply (score) the model to the 1/k holdout, and record needed model
assessment metrics.
4. Restore the first 1/k of the data, and set aside the next 1/k (excluding any
records that got picked the first time).
5. Repeat steps 2 and 3.
6. Repeat until each record has been used in the holdout portion.
7. Average or otherwise combine the model assessment metrics.

Including additional variables always reduces RMSE and increases r square. Hence, these are not appropriate to help guide the model choice. AIC is used to penalize adding terms to a model. In the case of regression, AIC has the form:

AIC:

The goal is to find the model that minimizes AIC, models with k more extra variables are penalized by 2k. The way to minimize is to search through all possible models. The other way is to use stepwise regression. It is a way to tune the model.

How do we find the model that minimizes AIC? One approach is to search through all possible models, called all subset regression. This is computationally expensive and is not feasible for problems with large data and many variables. An attractive alternative is to use stepwise regression, which successively adds and drops predictors to find a model that lowers AIC.

Weighted Regression
Weighted regression is used by statisticians for a variety of purposes; in
particular, it is important for the analysis of complex surveys. Data scientists may find weighted regression useful in two cases:
1. Inverse-variance weighting when different observations have been measured with different precision.
2. Analysis of data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.

For example, with the housing data, older sales are less reliable than more recent sales.

4.3 Prediction Using Regression

Much of the statistics involve understanding and measuring variability (uncertainty). The t-statistics and p-values reported in regression output deal with this in a formal way, which is sometimes useful for variable selection. More useful metrics are confidence intervals, which are uncertainty intervals placed around regression coefficients and predictions. An easy way to
understand this is via the bootstrap. Here is a bootstrap algorithm for generating confidence intervals for regression parameters (coefficients) for a data set with P predictors and n records (rows):

Bootstrap algorithm :

  1. Consider each row as a single ticket and place them all in a box
  2. Draw a ticket at random, record the values, and replace it in the box.
  3. Repeat step 2 n times, you now have one bootstrap resample.
  4. Fit a regression to the bootstrap sample, and record the estimated coefficients.
  5. Repeat steps 2 through 4, 1000 times
  6. 1000 bootstrap values for each coefficient, find the appropriate percentiles for each one.

The bootstrap algorithm for modeling both the regression model error and the individual data point error would look as follows:
1. Take a bootstrap sample from the data (spelled out in greater detail
earlier).
2. Fit the regression, and predict the new value.
3. Take a single residual at random from the original regression fit, add it to
the predicted value, and record the result.
4. Repeat steps 1 through 3, say, 1,000 times.
5. Find the 2.5th and the 97.5th percentiles of the results.

4.4 Factor Variables in Regression

Factor variables need to be converted into numeric variables for use in regression. The most common method to encode a factor variable with P distinct values is to represent them using P-1 dummy variables.
A factor variable with many levels, even in very big data sets, may need to be consolidated into a variable with fewer levels. Some factors have levels that are ordered and can be represented as a single numeric variable.

4.5 Interpreting the Regression Equation

Correlated variables are only one issue with interpreting regression coefficients. An extreme case of correlated variables produces multicollinearity — a condition in which there is redundancy among the predictor variables. Perfect multicollinearity occurs when one predictor variable can be expressed as a linear combination of others. Multicollinearity occurs when:

  1. A variable is included multiple times by error.
  2. P dummies, instead of P — 1 dummy, are created from a factor variable.
  3. Two variables are nearly perfectly correlated with one another.

Multicollinearity in regression must be addressed — variables should be removed until the multicollinearity is gone

Confounding Variables
With correlated variables, the problem is one of commission: including different variables that have a similar predictive relationship with the response. With confounding variables, the problem is one of omission: an important variable is not included in the regression equation. Naive interpretation of the equation coefficients can lead to invalid conclusions.

--

--

Sisi (Rachel) Chen
Sisi (Rachel) Chen

No responses yet