Distribution of residuals is of interest for linear models, not the distribution of y

Load packages

library(tidyverse)
library(e1071)

My \(y\) is not distributed according to my wishes!

Let \(Y\) be a variable that we would like to model, for instance, Covid-19 cases.

Now, there’s a widely hold belief that my \(Y\) must be distributed normally, or, in some cases, following some other assumed distribution (maybe some long-tailed distribution).

However, this belief is not (strictly) true. What a linear model assumes is that the residuals are distributed normally, not the \(Y\) distribution. While this is often left implicit, it is helpful to spell out a model explicitly.

Here’s a simple example from Gelmann and Hill, 2007, p. 46.

Consider a regression with the predictor \(X\), which takes on the values 0, 1, and 2, equally distributed:

\[ X \sim \mathcal{U}_D(a=1, b=3) \]

where \(\mathcal{U}_D\) referes to a discrete uniform distribution with minimum \(a\) and maximum \(b\).

Now, let the true association of \(X\) and \(Y\) amount to:

\[ y_i = 0.2 + 0.5x_i \]

Having specified this, we have completely specified this data generating process.

For modelling purposes, let’s add some noise:

\[ \hat{y_i} = 0.2 + 0.5x_i + e_i \\ e_i \sim \mathcal{N}(0, \, 0.1) \]

Simulate some data

Let’s draw a sample of \(n=1000\).

n <- 1000

set.seed(42)
X <- sample(x = c(1,2,3), size = n, replace = T)
e <- rnorm(n, mean = 0, sd = 0.1)

Compute \(Y\) and \(\hat{Y}\):

Y <- 0.2 + 0.5*X

Y_hat <- 0.02 + 0.5*X + e

And put it all in a data frame:

X <- tibble(
  X = X,
  Y = Y,
  Y_hat = Y_hat,
  e = e,
)

Plot the data

X %>%
  ggplot(aes(x = X)) +
  geom_histogram() +
  scale_x_continuous(limits = c(0,3))

X %>%
  ggplot(aes(x = Y)) +
  geom_histogram() +
  scale_x_continuous(limits = c(0,3))

X %>%
  ggplot(aes(x = Y_hat)) +
  geom_histogram() +
  scale_x_continuous(limits = c(0,3))

X %>%
  ggplot(aes(x = e)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-3,3))

And interstingly, \(e_i|X_i\):

X %>%
  ggplot(aes(x = e, group = X, 
             color = factor(X))) +
  geom_density()

Compare that to \(e\), the unconditional (marginal) error distribution:

X %>%
  ggplot(aes(x = e)) +
  geom_density()

Quite similar, but: do we have thicker tails in \(e\) compared to \(e|X\)? We have effectively mixed three distributions,so we might expect fatter tails.

e1071::kurtosis(X$X)
#> [1] -1.439171

Recall that the kurtosis of the normal distribution is 3.

As a side note (same source as above):

It is also common practice to use an adjusted version of Pearson’s kurtosis, the excess kurtosis, which is the kurtosis minus 3, to provide the comparison to the normal distribution.

Some authors use “kurtosis” by itself to refer to the excess kurtosis. For clarity and generality, however, this article follows the non-excess convention and explicitly indicates where excess kurtosis is meant.

The function kurtosis provides the standard form of kurtosis, where normal distributed variables have a value of 3.

Compare that to the conditional kurtosis \(Y|X\):

X %>% 
  group_by(X) %>% 
  summarise(kurtY = kurtosis(e))
#> # A tibble: 3 x 2
#>       X   kurtY
#>   <dbl>   <dbl>
#> 1     1 -0.615 
#> 2     2 -0.0636
#> 3     3 -0.319

There does not appear to be any difference in the kurtosis between \(e\) and \(e|X\).

Conclusion 1

The distribution of \(Y\) is not of importance when it comes to the assumptions a linear models. However, the distribution of residuals (“errors”) is of some importance.

Conclusion 2: Fuhgetaboutit, says Gelman

However, Gelman (same source as above) tells us that

assumption of normality is barely important at all.

(p. 46)

As a consequence

we do not recommend diagnostics of the normality of regression residuals.

(italics in original)

Debrief

The distribution of \(Y\) does not need to be similar to the distribution of the modelling residuals. There’s no need to check the \(Y\) distribution, at least not for modelling purposes (for regression type models).