'Limit options for col_type when importing using read_csv
I am using readr::read_csv to import a series of files, then updating with data through and API. read_csv generally does a good job of guessing column type, but seems to default to logical if there isn't data in the first 1000 rows of the file. If I was always using the same file/ knew which columns the specific file had, I could specify the column type (i.e. using col_spec= cols( sea_level_pressure_set_1d = col_double(),...) but since there are multiple files, they don't all have all the columns.
Specifically, it seems that read_csv defaults to logical, which leads to parsing failures.
Is there a way to force read_csv to follow a hierarchy of column types; limit its options to, say just character, double and datetime; or use a .default setting with unknown exceptions? using the .default arguement, it seems that I need to specify all the exceptions, and my problems arise when a file has a different format for an exception.
I would like read_csv to only assign datettime, numeric, and character columns.
Solution 1:[1]
I think I found your solution but may need more details from you to confirm:
First step:
You first want to get a col() output for each of your files which you can do manually or use spec_col() formula to automatically extract that for you (which I show below).
#first create mock dataframe
df1 <- tibble(x=1:3,y=letters[1:3])
df2 <- tibble(a=letters[4:6],b=4:6)
write_csv(df2,file = "df2.csv")
write_csv(df1,file = "df1.csv")
#create file path to pull in files
file_path <- list.files(path=".",pattern="df")
#extract out each file's col_type, set the default, and put it into a list
list <- map(file_path,
~spec_csv(.x,
col_types = cols(.default=col_character())
)
)
This produces the below output:
> list
[[1]]
cols(
x = col_double(),
y = col_character()
)
[[2]]
cols(
a = col_character(),
b = col_double()
)
second step: assign default (optional)
If you want to assign a default value then you need to use the below for loop (I couldn't figure out how to do this with map but the below works). This basically takes advantage of the special properties of the cols class object and assigns a default based on what you assign it. If you assign it a literal character value, the default will become character. If you assign it a literal numeric value, then the default will become numerical.
#get the length of your list
x <- seq_along(list)
#loop through each element of the list and assign the <collect_guess> argument a value
for(i in x){
list[[i]][[2]] <- list[[i]][[2]] <- "character"
}
This produces the below output:
> list
[[1]]
cols(
.default = col_character(),
x = col_double(),
y = col_character()
)
[[2]]
cols(
.default = col_character(),
a = col_character(),
b = col_double()
)
Step3:
With this variable "list" which stores each's file's col() argument, we then need to make a new tibble that pairs the col() arguments with the respective files. We can do this with a simple tibble.
col_map<- tibble(x=list,y=file_path)
Step 4
Then use a the map2_df function that allows you to pass simultaneously through two vectors (in this case the col() arguments (.x) and file path(.y) to a common function read_csv
map2_df(.x=col_map$x,
.y=col_map$y,
.f = ~read_csv(file=.y,col_types=.x)
)
That should save all your files to a new tibble.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
