Reading text files and Umlaute hassle

Data is often stored as plain text file. That’s good because it is a simple format. However, simplicity comes with a cost: Not all questions may have definite answers. The most common hassle when reading/importing text files is that the encoding scheme is unknown, aka wrong. This problem mostly occurs when, say, a Mac user stores a text file, where per default UTF8 text encoding is applied. In contrast, on a Windows machine, Windows-encoding (often dubbed “latin1”,“Windows 1252” or “ISO-8859-1”) is the default. What happens then is similar to what would happen if type a letter any you read thinking that it is in French – confusion guaranted.

With a little more details,

  • “Windows 1252” is a widely used 8-bit encoding scheme, including mostly English letters.
  • “ISO-8859-1” is also 8-bit, and covers the latin character set, thereby suitably depicting most European Languages. For example, the letter “A” is encoded as “0032” (hex).
  • “UTF-8” (“Unicode Transformation Format”) is the predominant encoding in the WWW; it covers more than 100 scripts (languages) from all over the world, including emojis, and special characters. Encoding a letter via UTF8 may produce a byte sequence betweeen 1 and 4 bytes. More frequently used letters are stored in the beginning of the code page, so that only 1 byte is needed for encoding.

According to Wikipedia, UTF8 is the defacto standard in the internet:

A little more in depth, consider that the text binary representation for some text (string). It will be pinned down to some bits and bytes. The question is what “translation code” from text to bits (or vice versa) is used? That’s exactly what’s defined in a text encoding scheme such aus UTF8. By the way, it seems that UTF8 is a reasonable standard, dear Windows user, consider following this standard. Your favorite text editor, such as Notepad++, or Text Mate, will offer you an option to “save with encoding” or similar. By the way, RStudio will do that for you, too (see File > Save with Encoding…).

So much for theory. Now, let’s test it. Assume we have a text file, and we do not know the encoding. So what we do is, we guess it:

Guessing encoding

library(readr)
library(readxl)

Here’s the file (ignore for the moment that the file name suggests an encoding):

fileurl <- "https://data-se.netlify.com/download/text-latin1.csv"
guess_encoding(fileurl)
## # A tibble: 1 x 2
##   encoding   confidence
##   <chr>           <dbl>
## 1 ISO-8859-1       0.36

Now that we have an educated guess, let’s try to parse it. The parameter locale can be used to adapt to local language settings such as used for text encoding.

Read textfile with specified encoding

text <- read_csv(file = fileurl, 
                 locale = locale(encoding = "ISO-8859-1"))
## Parsed with column specification:
## cols(
##   col1 = col_character(),
##   col2 = col_double(),
##   col3 = col_double()
## )

Let’s see whether it worked out:

text
## # A tibble: 3 x 3
##   col1        col2  col3
##   <chr>      <dbl> <dbl>
## 1 Nürnberg     1.1   1.1
## 2 München      2     2  
## 3 Düsseldorf   3     3

Save text file with specified encoding

Note that write_csv() will only write to UTF8.

write_csv(x = text, 
          path = "path/to/file.csv")

However, write.csv() and friends can be used to define the encoding:

write.csv(x = text, 
          file = "test2.csv", 
          fileEncoding = "utf8")

Hazzle free with Excel?

Maybe easier to use an excel data file where the encoding is defined within the data (xls(x)) file? Well, yes, but some times no.

Let’s try it:

textxlsxurl <- "https://data-se.netlify.com/download/text.xlsx"

destfilename <- "text.xlsx"

textxlsx <- download.file("https://data-se.netlify.com/download/text.xlsx",
                          destfile = destfilename)

file.exists(destfilename)
## [1] TRUE
text_xlsx <- read_xlsx(path = destfilename)
text_xlsx
## # A tibble: 3 x 3
##   col1        col2  col3
##   <chr>      <dbl> <dbl>
## 1 Nürnberg       1     1
## 2 München        2     2
## 3 Düsseldorf     3     3
if (file.exists(destfilename)) 
  file.remove(destfilename)  # remove file
## [1] TRUE

That seems to work easily. Maybe that’s the easiest way at the end of the day (?).

One problem that may arise - besides building on proprietary code that may change without notice - is that Excel may have problems reading a UTF8 csv, as explained here.

Get encoding of string

Assume there is some text (string), already loaded in your R session. How to determine and change the encoding then?

Here’s the string:

text$col1
## [1] "Nürnberg"   "München"    "Düsseldorf"

What’s the encoding?

Encoding(text$col1)
## [1] "UTF-8" "UTF-8" "UTF-8"

Now let’s change the encoding:

Encoding(text$col1) <- "latin1"
Encoding(text$col1)
## [1] "latin1" "latin1" "latin1"

To get an overview on supported encodings, use iconvlist(). Be aware, the list is long:

length(iconvlist())
## [1] 419

To get the default encoding of your system, use this function:

Sys.getlocale()
## [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

If R says the encoding is unknown it often means that it’s the system’s default.

For more details, check out this post.