'arrow parquet partitioning, multiple datasets in same directory structure in R

I have a multiple datasets stored in a partitioned parquet format using the same partitioning file structure, e.g. the directory structure is like:

1/a/predictions.parquet
1/a/summary.parquet

2/a/predictions.parquet
2/a/summary.parquet

1/b/predictions.parquet
1/b/summary.parquet

2/b/predictions.parquet
2/b/summary.parquet

and I want to read the two datasets independently using arrow::open_dataset(). I know I can use list.files(pattern = "predictions.*parquet") to get just the files I want then read those in with open_dataset(), however in this case I loose the partitioning.

Here's an example of what I want to do:

library(arrow)
library(dplyr)

tf <- tempfile()
dir.create(tf)

predictions <- expand.grid(var1 = 1:2, var2 = c("a", "b")) %>% 
  mutate(prediction = rnorm(nrow(.)))
summary <- expand.grid(var1 = 1:2, var2 = c("a", "b")) %>% 
  mutate(var3 = runif(nrow(.)))

write_dataset(predictions, tf, 
              partitioning = c("var1", "var2"), 
              basename_template = "predictions-{i}.parquet",
              hive_style = FALSE)
write_dataset(summary, tf, 
              partitioning = c("var1", "var2"), 
              basename_template = "summary-{i}.parquet",
              hive_style = FALSE)

list.files(tf, recursive = TRUE)

# partitioning lost
list.files(tf, recursive = TRUE, pattern = "predictions.*parquet", 
           full.names = TRUE) %>% 
  open_dataset() %>% 
  collect()

# tries to read both datasets at once
open_dataset(tf, partitioning = c("var1", "var2")) %>% 
  collect()

# what i want to do
open_dataset(tf, pattern = "predictions.*parquet",
             partitioning = c("var1", "var2")) %>% 
  collect()

unlink(tf)


Solution 1:[1]

This isn't currently a feature of the Datasets API; however, I've opened up a feature request on the project JIRA here: https://issues.apache.org/jira/browse/ARROW-15943

In the meantime, perhaps you could move the files into directories called "summary" or "prediction" so that then the directories can be used in the schema (or rather, write a script to do that - let me know if you need help with that).

Then run something like this:

open_dataset(tf, partitioning = c("var1", "var2", "analysis_type")) %>% 
  filter(analysis_type == "predictions") %>%
  select(-analysis_type) %>%
  collect()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thisisnic