'Some French accented characters are encoded as utf-8 but still not rendering properly

Hi there: I'm importing a Stata file that has a lot of French accented characters. The on import, I set the Encoding to utf-8. However, some of the accented characters are not rendering properly. See a sample of rows from my data-set below. How do I handle this?

test<-tibble::tribble(
  ~municipality,
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "Sainte-Anne-de-Beaupré",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©",
  "BeauprÃ©"
)
Encoding(test$municipality)
Encoding(test$municipality)<-'utf-8'
test$municipality

r utf-8 stringr

Solution 1:^[1]

As Giacomo mentions this seems to be an example of files where part of it (as you show also correct encoded é's where UTF-8, read it as it was Latin1 and encoded again as UTF-8, this means your encoding is correct as Ã© itself are utf-8 characters as well and displayed as such. What you can do is fix such errors in the past.

Knowing how it happened means we know how to fix it!

So I wrote a function in the past (it supports some major screw ups of tripple wrong encodings) so we simulate how each character becomes after wrong encoding and then saved as utf8 again.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

Lets run it on your data

v <- c("Sainte-Anne-de-Beaupré", "BeauprÃ©")
v
# [1] "Sainte-Anne-de-Beaupré" "BeauprÃ©"

stringi::stri_replace_all_fixed(v, names(fixes), fixes, vectorize_all = F)
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"

Another example:

str <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
str
# [1] "lélèlölã"

# how to corrupt it once
Encoding(str) <- "Windows-1252"
str <- iconv(str, to = "UTF-8")
str
# [1] "lÃ©lÃ¨lÃ¶lÃ£"

# add once, twice and tripple wrong encoded string to messy
messy <- c("lÃ©lÃ¨lÃ¶lÃ£", "lÃƒÂ©lÃƒÂ¨lÃƒÂ¶lÃƒÂ£", "lÃƒÆ’Ã‚Â©lÃƒÆ’Ã‚Â¨lÃƒÆ’Ã‚Â¶lÃƒÆ’Ã‚Â£")

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã" "lélèlölã" "lélèlölã"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Some French accented characters are encoded as utf-8 but still not rendering properly

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]