'Dealing with character variables containing semicolons in CSV files

I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.

For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.

File:

name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32

Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?



Solution 1:[1]

Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:

  1. read in the file line by line instead of treating it as a table:
tmp <- readLines(con <- file("table.csv"))
close(con)
  1. Find the position of the separator in every row. The entry before the first separator is the name the entry after the last is the age:
separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")
  1. fill the three elements using a for loop:
# the first line are the colnames
for(line in 2:length(tmp)){
     # from the beginning of the row to the first";"
     name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1) 
     # between the first ";" and the last ";".
     # Every ";" is a different elemet of the list
     hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1] +1,  
                                        separator_pos[[line]][length(separator_pos[[line]])]-1),";")
     #after the last ";", must be an integer
     age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])]+1, 
                               nchar(tmp[line])))
} 
  1. Create a separate matrix to hold the hobbies and fill it rowwise:
hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies)) 
    hobbies_matrix[line,1:length(hobbies[[line]])] <-  hobbies[[line]]   

  1. Add all variable to a data.frame:
df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
   name hobbies.1 hobbies.2 age
1   Jon   cooking      <NA>  38
2  Bill    karate   jogging  41
3 Maria   fishing      <NA>  32

Solution 2:[2]

You could also do:

read.csv(text=gsub('(^[^;]+);|;([^;]+$)', '\\1,\\2', readLines('file.csv')))

   name        hobbies age
1   Jon        cooking  38
2  Bill karate;jogging  41
3 Maria        fishing  32

Solution 3:[3]

Ideally you'd ask whoever generated the file to do it properly next time :) but of course this is not always possible.

Easiest way is probably to read the lines from the file into a character vector, then clean up and make a data frame by string matching.

library(readr)
library(dplyr)
library(stringr)

# skip header, add it later
dataset <- read_lines("your_file.csv", skip = 1)

dataset_df <- data.frame(name = str_match(dataset, "^(.*?);")[, 2], 
                         hobbies = str_match(dataset, ";(.*?);\\d")[, 2], 
                         age = as.numeric(str_match(dataset, ";(\\d+)$")[, 2]))

Result:

   name        hobbies age
1   Jon        cooking  38
2  Bill karate;jogging  41
3 Maria        fishing  32

Solution 4:[4]

Using the file created in the Note at the end

1) read.pattern can read this by specifying the pattern as a regular expression with the portions within parentheses representing the fields.

library(gsubfn)
read.pattern("hobbies.csv", pattern = '^(.*?);(.*);(.*)$', header = TRUE)
##    name        hobbies age
## 1   Jon        cooking  38
## 2  Bill karate;jogging  41
## 3 Maria        fishing  32

2) Base R Using base R we can read in the lines, put quotes around the middle field and then read it in normally.

L <- "hobbies.csv" |>
  readLines() |>
  sub(pattern = ';(.*);', replacement = ';"\\1";')
read.csv2(text = L)
##    name        hobbies age
## 1   Jon        cooking  38
## 2  Bill karate;jogging  41
## 3 Maria        fishing  32

Note

Lines <- "name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
"
cat(Lines, file = "hobbies.csv")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 onyambu
Solution 3 neilfws
Solution 4 G. Grothendieck