Why "n-1" in empirical variance? A simulation.

It is well-known that the empirical variance underestimates the population variance. Specifically, the empirical variance is defined as: \(var_{emp} = \frac{\sum_i (x_i - \bar{x})^2}{n-1}\). But why \(n-1\), why not just \(n\), as intuition (of some) dictates? Put shortly, as the variance of a sample tends to underestimate the population variance we have to inflate it artificially, to enlarge it, that’s why we do put a smaller number (the “n-1”) in the denominator, resulting in a larger value of the whole fraction. This larger value is called the empirical variance, it estimates the “real” population variance well.

In this post, we will not prove these ideas, but we will test it empirically. That is, we will let R draw samples and check whether the variance of the samples is slightly smaller than the variance in the population.

Load some packages:

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
## ✔ tidyr   0.8.0          ✔ stringr 1.3.0     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract

Then we’ll define a function that computes the empirical variance var_emp from a sample taken of a standard normal distribution.

var_emp <- function(n = 10) {
  
  rnorm(n = n, mean = 0, sd = 1) %>% 
    var()
}

# test it:
set.seed(42)
var_emp()
## [1] 0.6979747

By the way, to get the variance (or sd) of a sample as a description of the sample, we can use this computation:

n <- 10  # sample size
n_max <- 200  # max sample size to consider later on


var_sample <- function(n = 10) {
  
  rnorm(n = n, mean = 0, sd = 1) %>% 
      var(.) %>% 
     `*`((n-1)/n)
}

set.seed(42)
var_sample()
## [1] 0.6281773

The “correction factor” of \((n-1)/n\) amounts to a factor of .9 in the case of \(n=10\), ie., \((10-1)/10\).

But one sample is not sufficient to gauge an overall picture. Rather let’s draw many samples, say \(m=1000\), and compute the empirical variance and the sample variance each time.

m <- 1000
df_variance <- data_frame(id = 1:m)

df_variance %<>% 
  mutate(var_emp_vec = replicate(n = m, var_emp()),
         var_sample_vec = replicate(n = m, 
                                    var_sample())) %>% 
  gather(key = variance_type, value = variance, -id)

See here the distributions:

df_variance %>% 
  ggplot() +
  aes(x = variance, fill = variance_type) +
  geom_density(alpha = .5)

The sample variance distribution leans towards slightly smaller values, it appears.

What’s the central tendency in each case?

df_variance %>% 
  group_by(variance_type) %>% 
  summarise(mean_variance = mean(variance))
## # A tibble: 2 x 2
##   variance_type  mean_variance
##   <chr>                  <dbl>
## 1 var_emp_vec            1.01 
## 2 var_sample_vec         0.916

What about the quantiles of this sampling distribution?

df_variance %>% 
  split(.$variance_type) %>% 
  map(~quantile(.$variance, probs = c(.055, .5, .945)))
## $var_emp_vec
##      5.5%       50%     94.5% 
## 0.3597041 0.9429192 1.8472943 
## 
## $var_sample_vec
##      5.5%       50%     94.5% 
## 0.3542062 0.8542238 1.6649809

Our results show that the variance of the sample is smaller than the empirical variance; however even the empirical variance too is a little too small compared with the population variance (which is 1). Note that sample size was \(n=10\) in each draw of the simulation. With sample size increasing, both should get closer to the “real” (population) sample size (although the bias is negligible for the empirical variance). Let’s check that.

First let’s define a function which does the simulation for a given sample size \(n\) and a given number of samples, \(m\).

simu_mean <- function(n = 10, m = 1000, correction = 1){
  replicate(n = m, var_emp(n = n) * correction) %>% 
  mean()
}

This function draws a sample of size \(n\), computes the (empirical) variance, and multiply the result with a correction factor if needed (to estimate the sample variance). This procedure is repeated for \(m\) times. Finally, all \(m\) variance values are averaged:

set.seed(42)
simu_mean()
## [1] 1.006739

Now we let this function run for different samples sizes:

sizes <- 2:n_max


sizes %>% 
  map_dbl(~simu_mean(n = .)) %>% 
  as_tibble -> simu_df

simu_df %<>%
  mutate(sample_size = sizes)

names(simu_df)[1] <- "variance_empirical"

Finally, let’s plot the resulting curve:

simu_df %>% 
  ggplot +
  aes(x = sample_size, y = variance_empirical) +
  geom_line() +
  scale_y_continuous(limits = c(0.5, 1.5))

Even with small samples, the variance is close to the real variance in the population (1).

Compare this to the sample size curve when the variance is computing with \(n\) in the denominator (instead of \(n-1\)).

sizes <-2:n_max

sizes %>% 
  map_dbl(~simu_mean(n = ., correction = (. -1) / .)) %>% 
  as_tibble -> simu_df_sample

simu_df_sample %<>% 
  mutate(sample_size = sizes) 

names(simu_df_sample)[1] <-"variance_sample"
simu_df %>% 
  left_join(simu_df_sample, by = "sample_size") %>% 
  gather(key = variance_type, value = variance_value, -sample_size) %>% 
  ggplot() +
  aes(x = sample_size, y = variance_value, color = variance_type) +
  geom_line()

The sample variance underestimates the true value of the variance - at least for small samples. This bias diminishes when sample size increases.

Of interest, what is the variability of the mean variance for a given sample size? Let’s add these estimates of uncertainty.

This function computes the sd of the variances in the \(m\) samples (additionally to the mean values of the variances):

simu_mean_sd <- function(n = 10, m = 1000, correction = 1){
  simu_var_vec <- replicate(n = m, var_emp(n = n) * correction) 
  simu_var_mean <- mean(simu_var_vec)
  simu_var_sd <- sd(simu_var_vec)
  
  return(list(simu_var_mean = simu_var_mean,
              simu_var_sd = simu_var_sd))
}

Again, we run the function for different samples sizes:

sizes <- 2:n_max


df_variance_sds <- data_frame(sample_size = sizes)

df_variance_sds %<>% 
  rowwise() %>% 
  mutate(simu_mean_sd = simu_mean_sd(n = sample_size)[[2]])


df_variance_sds %<>%
  left_join(simu_df)
## Joining, by = "sample_size"

Again, let’s plot the resulting curve:

df_variance_sds %>% 
  ggplot +
  aes(x = sample_size, y = variance_empirical) +
  geom_ribbon(aes(ymin = variance_empirical - simu_mean_sd,
                  ymax = variance_empirical + simu_mean_sd),
              fill = "grey80") +
  geom_line() +
  scale_y_continuous(limits = c(0.5, 1.5))

Hey, the sd is still relatively huge. Would be interesting too know at which sample size the sd gets “small”, smaller than a desired width, \(w\). Say, we would like to know the sample size for \(w = 0.2\) which amounts to an sd of 0.1.

Let’s run our simulation a longer distance:

n_max <- 1000
sizes <- 2:n_max


df_variance_sds <- data_frame(sample_size = sizes)

df_variance_sds %<>% 
  rowwise() %>% 
  mutate(simu_mean_sd = simu_mean_sd(n = sample_size)[[2]],
         simu_mean_avg = simu_mean_sd(n = sample_size)[[1]])

Pooh! Needs some computation time.

How let’s glimpse at the course of the sd interval:

df_variance_sds %>% 
  ggplot +
  aes(x = sample_size, y = simu_mean_avg) +
  geom_ribbon(aes(ymin = simu_mean_avg - simu_mean_sd,
                  ymax = simu_mean_avg + simu_mean_sd),
              fill = "grey80") +
  geom_line() +
  scale_y_continuous(limits = c(0.5, 1.5))

What about the actual numbers. What are the smallest widths \(w\)?

df_variance_sds %<>% 
  mutate(width = 2 * simu_mean_sd) %>% 
  arrange(width) 

df_variance_sds
## # A tibble: 999 x 4
##    sample_size simu_mean_sd simu_mean_avg  width
##          <int>        <dbl>         <dbl>  <dbl>
##  1         972       0.0432         1.000 0.0864
##  2         968       0.0434         1.00  0.0867
##  3         992       0.0439         1.00  0.0877
##  4         981       0.0441         0.998 0.0882
##  5         996       0.0441         0.999 0.0882
##  6         938       0.0442         0.998 0.0885
##  7         995       0.0442         0.998 0.0885
##  8         982       0.0442         1.00  0.0885
##  9         984       0.0443         1.00  0.0885
## 10         976       0.0443         0.998 0.0886
## # ... with 989 more rows

Straightforward max and min values:

max(df_variance_sds$width)
## [1] 2.59451
min(df_variance_sds$width)
## [1] 0.08639865

And, the answer to the actual question asked, at what sample size falls \(w\) below 0.2?

quantile(df_variance_sds$width, probs = c(0.1, 0.5, 0.9)) 
##        10%        50%        90% 
## 0.09420279 0.12617887 0.28190605
df_variance_sds %>% 
  filter(width < 0.201 & width > 0.199) %>% 
  arrange(sample_size)
## # A tibble: 4 x 4
##   sample_size simu_mean_sd simu_mean_avg width
##         <int>        <dbl>         <dbl> <dbl>
## 1         190       0.100          0.997 0.200
## 2         191       0.100          0.996 0.200
## 3         204       0.0999         1.00  0.200
## 4         207       0.0995         0.997 0.199

So, at about \(n=200\), our sample size falls within the \(w=0.2\) corridor.

Same question for sd:

quantile(df_variance_sds$simu_mean_sd, probs = c(0.01, 0.1, 0.5, 0.9, .99)) 
##         1%        10%        50%        90%        99% 
## 0.04430252 0.04710139 0.06308943 0.14095303 0.43023127

Debrief

We have not really discussed “why” the sample variance is smaller than the “real” variance, that is why it is somewhat biased in small samples. But we have gone through some evidence that it is the case. One quite nice intuitive explanation “why” this happens can be found in the book of Chris Bishop.