Why standard regression is not (so) adequate for regressing proportions

Intro

Professor Sweet is conducting some research to investigate the risk factor and drivers of student exam success. In a recent analysis he considers the variable “exam successfully passed” (vs. not passed) as the criterion (output) and the amount of time spent for preparation (aka study time) as predictor.

Setup

Please make sure that all packages are installed before proceeding. Except pradadata, all packages are on CRAN. [ Here’s] (https://github.com/sebastiansauer/pradadatathe installation guide for pradadata.

# devtools::install_github("sebastiansauer/pradadata") install this packages if not yet installed
data(stats_test, package = "pradadata")

library(tidyverse)
library(ggpubr)
library(glue)
library(ggExtra)
library(broom)

Let’s change some output formatting of ggplot (see this post for details)

options(ggplot2.continuous.colour="viridis")
options(ggplot2.continuous.fill = "viridis")
scale_colour_discrete <- scale_colour_viridis_d
scale_fill_discrete <- scale_fill_viridis_d

theme_set(theme_pubr())

Glimpse

First, Professor Sweet checks the uni-variate distribution of each variable of interest:

stats_test <- stats_test %>% 
  mutate_if(is.integer, as.numeric)

glimpse(stats_test)
#> Observations: 1,646
#> Variables: 7
#> $ date_time  <chr> "08.12.2015 07:34:29", "08.12.2015 12:46:49", "08.1...
#> $ study_time <dbl> 3, 4, 3, 4, 2, 2, 4, 1, 3, 3, 2, 4, 3, 3, 3, 4, 4, ...
#> $ self_eval  <dbl> 7, 8, 4, 7, 6, 6, 6, 7, 8, 7, 8, 8, 7, 7, 4, 6, 8, ...
#> $ interest   <dbl> 4, 5, 1, 5, 3, 4, 3, 4, 5, 5, 3, 5, 4, 4, 1, 4, 5, ...
#> $ score      <dbl> 0.7500000, 0.8250000, 0.7894737, 0.8250000, 0.80000...
#> $ row_number <dbl> 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34,...
#> $ bestanden  <chr> "ja", "ja", "ja", "ja", "ja", "ja", "ja", "ja", "ja...

Add some noise

Let’s add some noise to study time in order to adjust for measurement error:

stats_test <- stats_test %>% 
  mutate(study_time_noise = study_time + rnorm(n = nrow(.), 0,.5))
p1 <- stats_test %>% 
  select(study_time_noise, score) %>% 
  map2(., names(.), ~ggplot(data = data_frame(.x), aes(x = .x)) + 
         labs(x = .y,
              title = glue("Variable: {.y}")) + 
         geom_histogram())

print(p1)
#> $study_time_noise
#> 
#> $score

Bivariate association

p2 <- ggplot(stats_test) +
  aes(x = study_time_noise, y = score) +
  geom_jitter(alpha = .5) +
  geom_smooth() +
  labs(title = "Bivariate association of study time \nand score")

p3 <- ggMarginal(p2)
p3

Regression

Now, Professor Sweet computes a simple regression. Curiously, he sits and watches what the software spits out. First, a visual sketch:

stats_test %>% 
  ggplot() +
  aes(y = score, x = study_time_noise) +
  geom_jitter(alpha = .5) +
  geom_smooth(method = "lm")

Then he computes the actual regression:

lm1 <- lm(score ~ study_time_noise, data = stats_test)
  
tidy(lm1)  
#> # A tibble: 2 x 5
#>   term             estimate std.error statistic  p.value
#>   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)        0.628    0.00827      75.9 0.      
#> 2 study_time_noise   0.0493   0.00267      18.5 1.48e-69

Compute the predicted values:

stats_test <- stats_test %>% 
  mutate(pred_lm1 = predict(lm1, newdata = stats_test))

Visual sketch:

p4 <- stats_test %>% 
  ggplot() +
  aes(x = study_time_noise) +
  geom_jitter(aes(y = score), alpha = .5) +
  geom_line(aes(y = pred_lm1), color = "red", size = 5) +
  labs(title = "Regression of exam score \nby study time")

p4

But what would happen if study time was, say, 10?

data_new <- data_frame(
  study_time_noise = 0:10,
  score = predict(lm1, newdata = data.frame(study_time_noise = 0:10))
)
p5 <- ggplot(stats_test) +
  aes(x = study_time_noise, y = score) +
  geom_jitter(alpha = .5) +
  geom_line(data = data_new, color = "red", size = 3) +
  scale_x_continuous(limits = c(0,10)) +
  scale_y_continuous(breaks = seq(0,1,.25)) +
  labs(title = "Regression with high values \nof study time")


p5

Pause

Now the exhausted professor sits and pauses in an attempt to conceive whats has been achieved.

ggarrange(p3, p4, p5, labels = c("A", "B", "C"))

This approach may yield implausible score values, ie above 1. Clearly, a different approach is needed.

Sigmoid curve

A simple alternative would be to replace to straight line by a sigmoid curve, such as in the following.

In essence, that’s the formula: \[p(X=1) = \frac{e^x}{1+e^x}\]

That’s how it looks like:


logist <- function(x){
  y = exp(x) / (1 + exp(x))
}

ggplot(data_frame()) +
  stat_function(aes(-5:5), fun = logist)

This function computes it as the best fit to a bunch of data points:

glm1 <- glm(score ~ study_time_noise, data = stats_test, family = "binomial")

Add the according predicted values:

stats_test <-  stats_test %>% 
  mutate(pred_glm1 = predict(glm1, newdata = stats_test, type = "response"))

Plot the predicted values:

df_glm <- data_frame(
  study_time_noise = 0:10,
  pred_glm1 = predict(glm1, 
                      newdata = data.frame(study_time_noise = 0:10),
                      type = "response")
)
p6 <- ggplot(stats_test) +
  aes(x = study_time_noise, y = score) +
  geom_jitter(alpha = .5) +
  labs(title = "Sigmoid regression line") + 
  geom_line(data = df_glm, aes(y = pred_glm1), 
               color = "red", size = 3)

p6

Pause again

ggarrange(p3, p4, p5, p6, labels = c("A", "B", "C", "D"))