'Loading CSV with fread stops because of to large string

This is the command I'm using :

dallData <- fread("data.csv", showProgress = TRUE, colClasses = c(rep("NULL", 2), "character", rep("NULL", 37)))

but I get this error when trying to load it: R character strings are limited to 2^31-1 bytes|

Anyway to skip those values ?



Solution 1:[1]

Here's a strategy that may work or at least narrow down the possible sources of error. It assumes you have enough working memory to hold the data and that your separators are really commas. If you actually have tabs as separators then you will need to modify accordingly. The plan is to read using readLines which will basically ignore the quotes that are probably mismatched. Then figure out which line or lines are at fault using count.fields, table, and which.

 input <- readLines("data.csv")   # ignores quotes
 counts.def <- count.fields(textConnection(input),
                            sep=",")  # defaults quotes are both ' and "
 table(counts.def) # might show a variety of line counts.

# Second try with just double-quotes
 counts.dbl <- count.fields(textConnection(input),
                            sep=",", quote="\"") # just dbl-quotes
 table(counts.dbl) # if all the same, then all you do is change the quotes argument

Depending on the results you may need to edit cerain lines which can be identified using which(counts.def < 40) assuming most of them are 40 as your input efforts suggest is the expected number of fields per line.

(If the tag for [ram] means you are limited and getting warnings or using virtual memory which slows things down horribly, then you should restart your OS, and only load R before trying again. R needs contiguous block of memory and Windoze isn't very good at memory management.)

Here's a small test case to work with:

input <- readLines(textConnection(
"v1,v2,v3,v4,v5,v6 
text, text, text, text, text, text 
text, text, O'Malley, text,text,text 
junk,junk, more junk, \"text\", tex\"t, nothing 
3,4,5,6,7,8")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1