2 min read

Most important asssumption in linear models ... and the second most

Load packages

library(tidyverse)
library(mosaic)

We are following here the advise of Gelman and Hill (2007).

Validity

Quite obviously, the right predictors must be included in the model in order to learn something from the model. The “right” predictors means: avoiding the wrong ones, and including the correct ones. Easier said than done, particularly with a look to the causal inference aspects. Let’s turn to the next most important assumption.

Additivity and linearity as the second most important assumptions in linear models

We assume that \(y\) is a linear function of the predictors. If y is not a linear function of the predictors, we cannot expect the model to deliver correct insights (predictions, causal coefficients). Let’s check an example.

Simulate an exponentially distributed assocation

We will only use positive values, let’s add some small value to the y’s in order they are all positive.

len <- 42  # 42 x values
set.seed(42)
x <- rep(runif(len), 30)  # each x value repeated 30 times
y <- dexp(x, rate = 5) + rnorm(length(x), mean = 0, sd = .1)  # add some noise

y <- y + 1
gf_point(y ~ x)

Linear model

Compute the Model as linear model (not transformed)

lm1 <- lm(y ~ x)
coef(lm1)
#> (Intercept)           x 
#>    3.797730   -3.458985

Plot it

gf_point(y ~ x) %>% 
  gf_abline(slope = coef(lm1)[2], intercept = coef(lm1)[1], color = "red")

Doesn’t fit well.

Check the linearity assumption

Actually the plot above vividly shows us that the linear model is not doing a good job. But anyhow, let’s put that in a form more typically used for checking model assumptions.

gf_point(lm1$residuals ~ fitted(lm1)) %>% 
  gf_smooth() %>% 
  gf_lims(y = c(-3, 3))

If we had a linear trend, we should see no strong pattern whatsoever. The mean residual should jump around zero at each value of the x-axis, with the variance remaining (more or less) constant. This is not the case here.

Second model: log-tranformed in y

gf_point(log(y) ~ x)

lm2 <- lm(log(y) ~ x)
coef(lm2)
#> (Intercept)           x 
#>    1.331698   -1.556101
gf_point(log(y) ~ x) %>% 
  gf_abline(slope = coef(lm2)[2], intercept = coef(lm2)[1], color = "red")

Quite ok. Note that the discrepancies stem completely from the fact that we added some Gaussian noise.

gf_point(resid(lm2) ~ fitted(lm2)) %>% 
  gf_smooth() %>% 
  gf_lims(y = c(-3, 3))

Much better than the first model.