'Some French accented characters are encoded as utf-8 but still not rendering properly
Hi there: I'm importing a Stata file that has a lot of French accented characters. The on import, I set the Encoding to utf-8. However, some of the accented characters are not rendering properly. See a sample of rows from my data-set below. How do I handle this?
test<-tibble::tribble(
~municipality,
"Sainte-Anne-de-Beaupré",
"Sainte-Anne-de-Beaupré",
"Sainte-Anne-de-Beaupré",
"Beaupré",
"Beaupré",
"Beaupré",
"Beaupré",
"Beaupré",
"Beaupré"
)
Encoding(test$municipality)
Encoding(test$municipality)<-'utf-8'
test$municipality
Solution 1:[1]
As Giacomo mentions this seems to be an example of files where part of it (as you show also correct encoded é's where UTF-8, read it as it was Latin1 and encoded again as UTF-8, this means your encoding is correct as é itself are utf-8 characters as well and displayed as such. What you can do is fix such errors in the past.
Knowing how it happened means we know how to fix it!
So I wrote a function in the past (it supports some major screw ups of tripple wrong encodings) so we simulate how each character becomes after wrong encoding and then saved as utf8 again.
FixEncoding <- function() {
# create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
# add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
once <- vapply(unicode, FUN.VALUE = character(1), function(x) {
Encoding(x) <- "Windows-1252"
iconv(x, to = "UTF-8")
})
fix_once <- unicode
names(fix_once) <- once
twice <- vapply(once, FUN.VALUE = character(1), function(x) {
Encoding(x) <- "Windows-1252"
iconv(x, to = "UTF-8")
})
fix_twice <- unicode
names(fix_twice) <- twice
triple <- vapply(twice, FUN.VALUE = character(1), function(x) {
Encoding(x) <- "Windows-1252"
iconv(x, to = "UTF-8")
})
fix_triple <- unicode
names(fix_triple) <- triple
fixes <- c(fix_triple, fix_twice, fix_once)
return(fixes)
}
fixes <- FixEncoding()
Lets run it on your data
v <- c("Sainte-Anne-de-Beaupré", "Beaupré")
v
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"
stringi::stri_replace_all_fixed(v, names(fixes), fixes, vectorize_all = F)
# [1] "Sainte-Anne-de-Beaupré" "Beaupré"
Another example:
str <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"
str
# [1] "lélèlölã"
# how to corrupt it once
Encoding(str) <- "Windows-1252"
str <- iconv(str, to = "UTF-8")
str
# [1] "lélèlölã"
# add once, twice and tripple wrong encoded string to messy
messy <- c("lélèlölã", "lélèlölã", "lélèlölã")
# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã" "lélèlölã" "lélèlölã"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
