Compute effect sizes with R. A primer.

A typical “cook book recipe” for doing data analysis is an applied stats course is:

  1. report descriptive statistics
  2. plot some nice diagrams
  3. test hypothesis
  4. report effect sizes

Let’s have a quick glance at these steps. We will use the dataset flights of the package nycflights13.

data(flights, package = "nycflights13")

This post will be tidyverse-driven.

library(tidyverse)
library(skimr)
library(mosaic)

Let’s compute some summaries:

flights %>% 
  select(arr_delay) %>% 
  skim
Data summary
Name Piped data
Number of rows 336776
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
arr_delay 9430 0.97 6.9 44.63 -86 -17 -5 14 1272 ▇▁▁▁▁

Alternatively, using mosaic:

mosaic::favstats(~arr_delay, data = flights)
minQ1medianQ3maxmeansdnmissing
-86-17-5141.27e+036.944.63273469430

Subgroup statistics

Differentiating between origin levels:

flights %>% 
  select(arr_delay, origin) %>%
  group_by(origin) %>% 
  skim
Data summary
Name Piped data
Number of rows 336776
Number of columns 2
_______________________
Column type frequency:
numeric 1
________________________
Group variables origin

Variable type: numeric

skim_variable origin n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
arr_delay EWR 3708 0.97 9.11 45.53 -86 -16 -4 16 1109 ▇▁▁▁▁
arr_delay JFK 2200 0.98 5.55 44.28 -79 -18 -6 13 1272 ▇▁▁▁▁
arr_delay LGA 3522 0.97 5.78 43.86 -68 -17 -5 12 915 ▇▁▁▁▁

Alternatively, using mosaic:

favstats(arr_delay~origin, data = flights)
originminQ1medianQ3maxmeansdnmissing
EWR-86-16-4161.11e+039.1145.51171273708
JFK-79-18-6131.27e+035.5544.31090792200
LGA-68-17-512915       5.7843.91011403522

Effect sizes

Cohen’s d

library(effsize)

We need two groups not three:

flights2 <-
  filter(flights, origin != "JFK") %>%
  sample_n(1000) %>% 
  na.omit
cohen.d(d = flights2$arr_delay,
        f = flights2$origin)
#> 
#> Cohen's d
#> 
#> d estimate: 0.223211 (small)
#> 95 percent confidence interval:
#>     lower     upper 
#> 0.0961037 0.3503182

Plot mean difference

ggplot(flights2) +
  aes(x = origin, y = arr_delay) +
  geom_point(color = "grey80", position = "jitter") +
  stat_summary(fun.y = mean, geom = "point", color = "red", size = 5)

Other effect sizes

Other effect sizes can quite conveniently be derived from the package compute.es.