How to standardize variables in R

1 Motivation

Running a regression in R yields unstandardized coefficients, not standardized ones. However, as is spelled out by eg., Gelman and Hill (2007), standardizing values is of advantages in many situations. This post shows how run a regression in R using standardized values as inputs (“standardized regression” for short, as some dup it).

The advantage of standardizing input variables is the simpler comparison of importance. It can be seen as undesirable that the scaling (SD) of the input variable determines (in part) the regression coefficient. For instance, measuring the “power” of a car in horse power or in kilowatt will strongly influence the value of the regression coefficient. Similarly, measuring the distance walked in kilomweters or in millimeters will have an profound effect on the respective regression coefficient on, say, the amount of fat burned (in grams or in kilo grams…).

Hence, having all variables on the same scale will facilitate easy comparison of the “importance” of each variable, as now all variables are on the same scale.

The most common way to standardize the variable \(X\) is to use the \(z\) transformation:

\[z_i = \frac{x_i - \mu}{sd_X}\]

2 Load packages

library(tidyverse)  # data wrangling
library(broom)  # tidy regression output
library(mosaic)  # standardizing variables

3 Some data

mtcars <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")

4 Research question

Say, we are interested in the association of horse power (hp) and fuel consumption (mpg; miles per gallon): What’s the difference in the fuel consumption between cars which differ in their horse power?

5 Regression with unstandardized input variables

lm1 <- lm(mpg ~ hp, data = mtcars)

tidy(lm1)
term estimate std.error statistic p.value
(Intercept) 30.0988605 1.6339210 18.421246 0e+00
hp -0.0682283 0.0101193 -6.742388 2e-07

As cen be seen in the output, our model lm1 estimates that the cars which differ in 1 hp, differ in -0.07 miles per gallon, on overage (and given the model is true). That is, a car with 1 hp more, goes 0.07 miles less (compared to a car with 1 hp less).

6 Standardize input variables

mtcars_standardized <- 
  mtcars %>% 
  mutate(hp_s = scale(hp))

As we see, scale does the trick, that is the z transformation. For example:

x <- c(0,10, 20)
scale(x)
-1
0
1

Let’s double check:

x_mean <- mean(x)
x_sd <- sd(x)

z <- (x - mean(x)) / sd(x)
z
#> [1] -1  0  1

It’s not so nice that scale() takes a vector as input, but hands back a matrix.

A similar function, zscore() is provided by the package {mosaic}; this function gives back a vector which is more comfortable:

zscore(x)
#> [1] -1  0  1

7 Regression with standardized input variables

lm2 <- lm(mpg ~ hp_s, data = mtcars_standardized)

tidy(lm2)
term estimate std.error statistic p.value
(Intercept) 20.090625 0.6828817 29.420359 0e+00
hp_s -4.677926 0.6938085 -6.742388 2e-07

8 The models (lm1 and lm2) are identical

Have a look at the p-values and the model fit values of both models (lm1 and lm2) to reassure yourself that both models are indeed equivalant, as it should be:

glance(lm1)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6024373 0.5891853 3.862962 45.4598 2e-07 1 -87.61931 181.2386 185.6358 447.6743 30 32
glance(lm2)
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.6024373 0.5891853 3.862962 45.4598 2e-07 1 -87.61931 181.2386 185.6358 447.6743 30 32

9 Interpretation of a standardized regression coefficient

“According to our model, lm2, cars differ in their fuel consumption (measured as miles consumed per gallon) such that a cars with 1 SD higher horse power value consume one average approx. 5 gallons less fuel.”

10 Reproducibility

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS  10.16                
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2021-02-26                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                             
#>  assertthat    0.2.1       2019-03-21 [1] CRAN (R 4.0.0)                     
#>  backports     1.2.1       2020-12-09 [1] CRAN (R 4.0.2)                     
#>  blogdown      1.1         2021-01-19 [1] CRAN (R 4.0.2)                     
#>  bookdown      0.21.6      2021-02-02 [1] Github (rstudio/bookdown@6c7346a)  
#>  broom       * 0.7.5       2021-02-19 [1] CRAN (R 4.0.2)                     
#>  bslib         0.2.4.9000  2021-02-02 [1] Github (rstudio/bslib@b3cd7a9)     
#>  cachem        1.0.4       2021-02-13 [1] CRAN (R 4.0.2)                     
#>  callr         3.5.1       2020-10-13 [1] CRAN (R 4.0.2)                     
#>  cellranger    1.1.0       2016-07-27 [1] CRAN (R 4.0.0)                     
#>  cli           2.3.1       2021-02-23 [1] CRAN (R 4.0.2)                     
#>  codetools     0.2-16      2018-12-24 [2] CRAN (R 4.0.2)                     
#>  colorspace    2.0-0       2020-11-11 [1] CRAN (R 4.0.2)                     
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.0.2)                     
#>  curl          4.3         2019-12-02 [1] CRAN (R 4.0.0)                     
#>  DBI           1.1.1       2021-01-15 [1] CRAN (R 4.0.2)                     
#>  dbplyr        2.1.0       2021-02-03 [1] CRAN (R 4.0.2)                     
#>  debugme       1.1.0       2017-10-22 [1] CRAN (R 4.0.0)                     
#>  desc          1.2.0       2018-05-01 [1] CRAN (R 4.0.0)                     
#>  devtools      2.3.2       2020-09-18 [1] CRAN (R 4.0.2)                     
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.2)                     
#>  dplyr       * 1.0.4       2021-02-02 [1] CRAN (R 4.0.2)                     
#>  ellipsis      0.3.1       2020-05-15 [1] CRAN (R 4.0.0)                     
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.0)                     
#>  fansi         0.4.2       2021-01-15 [1] CRAN (R 4.0.2)                     
#>  fastmap       1.1.0       2021-01-25 [1] CRAN (R 4.0.2)                     
#>  forcats     * 0.5.1       2021-01-27 [1] CRAN (R 4.0.2)                     
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.2)                     
#>  generics      0.1.0       2020-10-31 [1] CRAN (R 4.0.2)                     
#>  ggplot2     * 3.3.3       2020-12-30 [1] CRAN (R 4.0.2)                     
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.2)                     
#>  gtable        0.3.0       2019-03-25 [1] CRAN (R 4.0.0)                     
#>  haven         2.3.1       2020-06-01 [1] CRAN (R 4.0.0)                     
#>  hms           1.0.0       2021-01-13 [1] CRAN (R 4.0.2)                     
#>  htmltools     0.5.1.1     2021-01-22 [1] CRAN (R 4.0.2)                     
#>  httr          1.4.2       2020-07-20 [1] CRAN (R 4.0.2)                     
#>  jquerylib     0.1.3       2020-12-17 [1] CRAN (R 4.0.2)                     
#>  jsonlite      1.7.2       2020-12-09 [1] CRAN (R 4.0.2)                     
#>  knitr         1.31        2021-01-27 [1] CRAN (R 4.0.2)                     
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 4.0.2)                     
#>  lubridate     1.7.9.2     2020-11-13 [1] CRAN (R 4.0.2)                     
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.2)                     
#>  memoise       2.0.0       2021-01-26 [1] CRAN (R 4.0.2)                     
#>  modelr        0.1.8       2020-05-19 [1] CRAN (R 4.0.0)                     
#>  munsell       0.5.0       2018-06-12 [1] CRAN (R 4.0.0)                     
#>  pillar        1.5.0       2021-02-22 [1] CRAN (R 4.0.2)                     
#>  pkgbuild      1.2.0       2020-12-15 [1] CRAN (R 4.0.2)                     
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.0)                     
#>  pkgload       1.2.0       2021-02-23 [1] CRAN (R 4.0.2)                     
#>  prettyunits   1.1.1       2020-01-24 [1] CRAN (R 4.0.0)                     
#>  processx      3.4.5       2020-11-30 [1] CRAN (R 4.0.2)                     
#>  ps            1.5.0       2020-12-05 [1] CRAN (R 4.0.2)                     
#>  purrr       * 0.3.4       2020-04-17 [1] CRAN (R 4.0.0)                     
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.2)                     
#>  Rcpp          1.0.6       2021-01-15 [1] CRAN (R 4.0.2)                     
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.2)                     
#>  readxl        1.3.1       2019-03-13 [1] CRAN (R 4.0.0)                     
#>  remotes       2.2.0       2020-07-21 [1] CRAN (R 4.0.2)                     
#>  reprex        1.0.0       2021-01-27 [1] CRAN (R 4.0.2)                     
#>  rlang         0.4.10      2020-12-30 [1] CRAN (R 4.0.2)                     
#>  rmarkdown     2.7         2021-02-19 [1] CRAN (R 4.0.2)                     
#>  rprojroot     2.0.2       2020-11-15 [1] CRAN (R 4.0.2)                     
#>  rstudioapi    0.13.0-9000 2021-02-11 [1] Github (rstudio/rstudioapi@9d21f50)
#>  rvest         0.3.6       2020-07-25 [1] CRAN (R 4.0.2)                     
#>  sass          0.3.1       2021-01-24 [1] CRAN (R 4.0.2)                     
#>  scales        1.1.1       2020-05-11 [1] CRAN (R 4.0.0)                     
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.0)                     
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.2)                     
#>  stringr     * 1.4.0       2019-02-10 [1] CRAN (R 4.0.0)                     
#>  testthat      3.0.2       2021-02-14 [1] CRAN (R 4.0.2)                     
#>  tibble      * 3.0.6       2021-01-29 [1] CRAN (R 4.0.2)                     
#>  tidyr       * 1.1.2       2020-08-27 [1] CRAN (R 4.0.2)                     
#>  tidyselect    1.1.0       2020-05-11 [1] CRAN (R 4.0.0)                     
#>  tidyverse   * 1.3.0       2019-11-21 [1] CRAN (R 4.0.0)                     
#>  usethis       2.0.1       2021-02-10 [1] CRAN (R 4.0.2)                     
#>  utf8          1.1.4       2018-05-24 [1] CRAN (R 4.0.0)                     
#>  vctrs         0.3.6       2020-12-17 [1] CRAN (R 4.0.2)                     
#>  withr         2.4.1       2021-01-26 [1] CRAN (R 4.0.2)                     
#>  xfun          0.21        2021-02-10 [1] CRAN (R 4.0.2)                     
#>  xml2          1.3.2       2020-04-23 [1] CRAN (R 4.0.0)                     
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.0)                     
#> 
#> [1] /Users/sebastiansaueruser/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library