Binning and recoding with R - some recommendations

Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.

Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.

Let’s load some example data:

data(tips, package = "reshape2")

Some packages:

library(mosaic)

One nice way is using the function case_when() from the tidyverse community. Consider this example:

tips$tip_gruppe <- case_when(
  tips$tip < 2 ~ "scrooge",
  tips$tip < 4 ~ "ok",
  tips$tip < 8 ~ "generous",
  TRUE ~ "in love"
)

Wait, case_when is pipe-friendly, see:

tips <- tips %>% 
  mutate(tip_gruppe = case_when(
    tip < 2 ~ "scrooge",
    tip < 4 ~ "ok",
    tip < 8 ~ "generous",
    TRUE ~ "in love"
  ))

One subsequent step could be to use the new variable in a \(\chi^2\) test:

xchisq.test(tip_gruppe ~ sex, data = tips)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  tally(x, data = data)
#> X-squared = 1.7171, df = 3, p-value = 0.6331
#> 
#>    16       35   
#> (18.18)  (32.82) 
#> [0.262]  [0.145] 
#> <-0.51>  < 0.38> 
#>    
#>     0        2   
#> ( 0.71)  ( 1.29) 
#> [0.713]  [0.395] 
#> <-0.84>  < 0.63> 
#>    
#>    54       92   
#> (52.06)  (93.94) 
#> [0.072]  [0.040] 
#> < 0.27>  <-0.20> 
#>    
#>    17       28   
#> (16.05)  (28.95) 
#> [0.057]  [0.031] 
#> < 0.24>  <-0.18> 
#>    
#> key:
#>  observed
#>  (expected)
#>  [contribution to X-squared]
#>  <Pearson residual>

Similarly, use case_when for nominal variables:

tips <- tips %>% 
  mutate(weekend = case_when(
    day == "Fri" ~ "weekend",
    day == "Sat" ~ "weekend",
    TRUE ~ "keep on working"
  ))

Note that TRUE indicates “else do …”, in this case read “else ‘weekend’ is ‘keep on working’”.

A convinient way to bin several values (such as “Fri”, “Sat”) into one (such as “weekend”) is the %in operator:

tips <- tips %>% 
  mutate(weekend = case_when(
    day %in% c("Fri", "Sat") ~ "weekend",
    TRUE ~ "keep on working"
  ))

Another convenient way is using rec from the r package sjmisc:

library(sjmisc)
tips <- rec(tips, day,
            rec = "Fri=Weekend; Sat=Weekend; else = keep_working")

count(tips, day_r)
#> # A tibble: 2 x 2
#>   day_r            n
#>   <fct>        <int>
#> 1 keep_working   138
#> 2 Weekend        106

Note that a new, recoded variable is appended using the suffix _r. See:

glimpse(tips)
#> Observations: 244
#> Variables: 10
#> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
#> $ tip        <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
#> $ sex        <fct> Female, Male, Male, Male, Female, Male, Male, Male,...
#> $ smoker     <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
#> $ day        <fct> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, S...
#> $ time       <fct> Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Din...
#> $ size       <int> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
#> $ tip_gruppe <chr> "scrooge", "scrooge", "ok", "ok", "ok", "generous",...
#> $ weekend    <chr> "keep on working", "keep on working", "keep on work...
#> $ day_r      <fct> keep_working, keep_working, keep_working, keep_work...

Note that the pipe will work too:

tips <- tips %>% 
  rec(day,
      rec = "Fri=Weekend; Sat=Weekend; else = keep_working")

rec is convenient as one does not need to use mutate.

Use ?rec for more infos.

The good thing on both ways (case_when and rec) is that both functions can be used both for recoding and some binning purposes.