'How keep the encoding of an .rtf
I'm Brazilian and this is my first time working with text in R.
I need to convert .rtf to .txt files.
Since I'm Brazilian, I speak portuguese and I need these characters: ã, õ, í, ç, ô, and others.
example
text <- "o de garantias adequadas e suficientes do cooperado ou de seus garantidores e a observância das demais normas regulamentares oficiais e internas do Sistema, e com respeito aos princípios da boa gestão, da seletividade, da diversificação"
My original code:
files <- list.files("estatutos/rtf/", pattern = "\\.rtf$")
for (file in 1:length(files)) {
x <- files[file]
# read RTF into R
y <- striprtf::read_rtf(paste("estatutos/rtf/", x, "", sep = ""))
# strip RTF encoding
z <- striprtf::strip_rtf(y) |>
iconv(from = "UTF-8", to = "latin1")
# Write each to a TXT file by its original name
write(z, paste("estatutos/txt/", x, ".txt", sep = ""))
# Tell about progress
cat("Processing file", x, " - ", file, "of", length(files), "\n")
}
rm(files, x, y, z, file, i)
My problem:
When I read the .rtf I had a list like this
[523] "o de garantias adequadas e suficientes do cooperado ou de seus garantidores e a observ"
[524] "â"
[525] "ncia das demais normas regulamentares oficiais e internas do Sistema, e com respeito aos princ"
[526] "í"
[527] "pios da boa gest"
[528] "ã"
[529] "o, da seletividade, da diversifica"
[530] ""
[531] "o de riscos e da seguran"
[532] "ç"
[533] "a operacional."
How can I put this in one text?
I tried to use striprtf::strip_rtf(), and I had the full text, but also:
[1] o de garantias adequadas e suficientes do cooperado ou de seus garantidores e a observ?ncia das demais normas regulamentares oficiais e internas do Sistema, e com respeito aos princ?pios da boa gest?o, da seletividade, da diversificao
"?" should be 'â', 'í', 'ã' and "çã".
ps: I also tried to put a list as string with paste(y, collapse=''), but the '?' still appearing,
Solution 1:[1]
striprtf::read_rtf is using readLines() to read the actual file.
readLines() has an encoding argument.
Most likely something like striprtf::read_rtf(..., encoding = "latin1") would work for you, as "latin1" work for most western european languages, such as portuguese.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | jpiversen |
