'Error in tolower() invalid multibyte string
This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Solution 1:[1]
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
Solution 2:[2]
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
Solution 3:[3]
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
Solution 4:[4]
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
Solution 5:[5]
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | histelheim |
| Solution 3 | |
| Solution 4 | Edgar Manukyan |
| Solution 5 | M_Merciless |
