'arrow parquet partitioning, multiple datasets in same directory structure in R
I have a multiple datasets stored in a partitioned parquet format using the same partitioning file structure, e.g. the directory structure is like:
1/a/predictions.parquet
1/a/summary.parquet
2/a/predictions.parquet
2/a/summary.parquet
1/b/predictions.parquet
1/b/summary.parquet
2/b/predictions.parquet
2/b/summary.parquet
and I want to read the two datasets independently using arrow::open_dataset(). I know I can use list.files(pattern = "predictions.*parquet") to get just the files I want then read those in with open_dataset(), however in this case I loose the partitioning.
Here's an example of what I want to do:
library(arrow)
library(dplyr)
tf <- tempfile()
dir.create(tf)
predictions <- expand.grid(var1 = 1:2, var2 = c("a", "b")) %>%
mutate(prediction = rnorm(nrow(.)))
summary <- expand.grid(var1 = 1:2, var2 = c("a", "b")) %>%
mutate(var3 = runif(nrow(.)))
write_dataset(predictions, tf,
partitioning = c("var1", "var2"),
basename_template = "predictions-{i}.parquet",
hive_style = FALSE)
write_dataset(summary, tf,
partitioning = c("var1", "var2"),
basename_template = "summary-{i}.parquet",
hive_style = FALSE)
list.files(tf, recursive = TRUE)
# partitioning lost
list.files(tf, recursive = TRUE, pattern = "predictions.*parquet",
full.names = TRUE) %>%
open_dataset() %>%
collect()
# tries to read both datasets at once
open_dataset(tf, partitioning = c("var1", "var2")) %>%
collect()
# what i want to do
open_dataset(tf, pattern = "predictions.*parquet",
partitioning = c("var1", "var2")) %>%
collect()
unlink(tf)
Solution 1:[1]
This isn't currently a feature of the Datasets API; however, I've opened up a feature request on the project JIRA here: https://issues.apache.org/jira/browse/ARROW-15943
In the meantime, perhaps you could move the files into directories called "summary" or "prediction" so that then the directories can be used in the schema (or rather, write a script to do that - let me know if you need help with that).
Then run something like this:
open_dataset(tf, partitioning = c("var1", "var2", "analysis_type")) %>%
filter(analysis_type == "predictions") %>%
select(-analysis_type) %>%
collect()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | thisisnic |
