R makes it easy to fit a linear model to your data. The hard part is knowing whether the model you've built is worth keeping and, if so, figuring out what to do next.
This is a post about linear models in
R, how to interpret
lm results, and common rules of thumb to help side-step the most common mistakes.
Building a linear model in R
R makes building linear models really easy. Things like dummy variables, categorical features, interactions, and multiple regression all come very naturally. The centerpiece for linear regression in
R is the
lm comes with base
R, so you don't have to install any packages or import anything special. The documentation for
lm is very extensive, so if you have any questions about using it, just type
?lm into the
For our example linear model, I'm going to use data from the original, or at least one of the earliest, linear regression models. The dataset consists of heights of children and their parents. The origin of the term "regression" stems from a 19th century statistician's observation that children's heights tended to "regress" towards the population mean in relation to their parent's heights.
Fit the model to the data by creating a formula and passing it to the
lm function. In our case we want to use the parent's height to predict the child's height, so we make the formula
(child ~ parent). In other words, we're representing the relationship between parents' heights (X) and children's heights (y).
We then set the data being used to
lm knows what data frame to associate "child" and "parent" to.
NOTE: Formulas in
R take the form
(y ~ x). To add more predictor variables, just use the
+ sign. i.e.
(y ~ x + z).
We fit a model to our data. That's great! But the important question is, is it any good?
There are lots of ways to evaluate model fit.
lm consolidates some of the most popular ways into the
summary function. You can invoke the
summary function on any model you've fit with
lm and get some metrics indicating the quality of the fit.
So if you're like I was at first, your reaction was probably something like "Whoa this is cool...what does it mean?"
Interpreting the output
|1||Residuals||The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression--
Think of it like a dartboard. A good model is going to hit the bullseye some of the time (but not everytime). When it doesn't hit the bullseye, it's missing in all of the other buckets evenly (i.e. not just missing in the 16 bin) and it also misses closer to the bullseye as opposed to on the outer edges of the dartboard.
|2||Significance Stars||The stars are shorthand for significance levels, with the number of asterisks displayed according to the p-value computed.
|3||Estimated Coeffecient||The estimated coefficient is the value of slope calculated by the regression. It might seem a little confusing that the Intercept also has a value, but just think of it as a slope that is always multiplied by 1. This number will obviously vary based on the magnitude of the variable you're inputting into the regression, but it's always good to spot check this number to make sure it seems reasonable.|
|4||Standard Error of the Coefficient Estimate||Measure of the variability in the estimate for the coefficient. Lower means better but this number is relative to the value of the coefficient. As a rule of thumb, you'd like this value to be at least an order of magnitude less than the coefficient estimate.
In our example, the std error or the parent variable is 0.04 which is 16x less than the estimate of the coefficient (or 1.6 orders of magnitude greater).
|5||t-value of the Coefficient Estimate||Score that measures whether or not the coefficient for this variable is meaningful for the model. You probably won't use this value itself, but know that it is used to calculate the p-value and the significance levels.|
|6||Variable p-value||Probability the variable is NOT relevant. You want this number to be as small as possible. If the number is really small,
|7||Significance Legend||The more punctuation there is next to your variables, the better.
Blank=bad, Dots=pretty good, Stars=good, More Stars=very good
|8||Residual Std Error / Degrees of Freedom||The Residual Std Error is just the standard deviation of your residuals. You'd like this number to be proportional to the quantiles of the residuals in #1. For a normal distribution, the 1st and 3rd quantiles should be 1.5 +/- the std error.
The Degrees of Freedom is the difference between the number of observations included in your training sample and the number of variables used in your model (intercept counts as a variable).
|9||R-squared||Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in what you're predicting that is explained by the model. In this instance, ~21% of the cause for a child's height is due to the height their parent.
WARNING: While a high R-squared indicates good correlation, correlation does not always imply causation.
|10||F-statistic & resulting p-value||Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and compares it to a model that has fewer parameters. In theory the model with more parameters should fit better. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost). If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.
The DF, or degrees of freedom, pertains to how many variables are in the model. In our case there is one variable so there is one degree of freedom.
People often wonder how they can include categorical variables in their regression models. With
R this is extremely easy. Just include the categorical variable in your regression formula and
R will take care of the rest.
R calls categorical variables
factor has a set of levels, or possible values. These levels will show up as variables in the model summary.
Dummy Variable Trap
One very important thing to note is that one of your levels will not appear in the output. This is because when fitting a regression with a categorical variable, one option must be left out to avoid overfitting the model. This is often referred to as the dummy variable trap. In our model, Africa is left out of the summary but it is still accounted for in the model.
It's often tricker to spot a bad model rather than pick out a good model. Be sure to rigorously evaluate models--don't just take the easy way out and spot check the R-squared value!
R provides you with tons of different ways to check your models. For more information check out these resources: