'How to connect to parquet files in Azure Blob Storage with arrow::open_dataset?

I am open to other ways of doing this. Here are my constraints:

  • I have parquet files in a container in Azure Blob Storage
  • These parquet files will be partitioned by a product id, as well as the date (year/month/day)
  • I am doing this in R, and want to be able to connect interactively (not just set up a notebook in databricks, though that is something I will probably want to figure out later)

Here's what I am able to do:

  • I understand how to use arrow::open_dataset() to connect to a local parquet directory: ds <- arrow::open_dataset(filepath, partitioning = "product")
  • I can connect to, view, and download from my blob container with the AzureStor package. I can download a single parquet file this way and turn it into a data frame:
blob <- AzureStor::storage_endpoint("{URL}", key="{KEY}")
cont <- AzureStor::storage_container(blob, "{CONTAINER-NAME}")
parq <- AzureStor::storage_download(cont, src = "{FILE-PATH}", dest = NULL)
df <- arrow::read_parquet(parq)

What I haven't been able to figure out is how to use arrow::open_dataset() to reference the parent directory of {FILE-PATH}, where I have all the parquet files, using the connection to the container that I'm creating with AzureStor. arrow::open_dataset() only accepts a character vector as the "sources" parameter. If I just give it the URL with the path, I'm not passing any kind of credential to access the container.



Solution 1:[1]

Unfortunately, you probably are not going to be able to do this today purely from R.

Arrow-R is based on Arrow-C++ and Arrow-C++ does not yet have a filesystem implementation for Azure. There are JIRA tickets ARROW-9611,ARROW-2034 for creating one but these tickets are not in progress at the moment.

In python it is possible to create a filesystem purely in python using the FSspec adapter. Since there is a python SDK for Azure Blob Storage it should be possible to do what you want today in python.

Presumably something similar could be created for R but you would still need to create the R equivalent of the fsspec adapter and that would involve some C++ code.

Solution 2:[2]

If you use Azure Synapse then you can connect to your data with odbc as if it were a SQL Server database and it has support for partitioning and other files types as well. The pricing, from what I recall, is like $5/month fixed plus $5/TB queried.

Querying data would look something like this...

library(odbc)
syncon <- dbConnect(odbc(),
                    Driver = "SQL Server Native Client 11.0",
                    Server = "yourname-ondemand.sql.azuresynapse.net",
                    Database = "dbname",
                    UID = "sqladminuser",
                    PWD = rstudioapi::askForPassword("Database password"),
                    Port = 1433)

somedata <- dbGetQuery(syncon, r"---{SELECT top 100 
                       result.filepath(1) as year,
                       result.filepath(2) as month,
                       *
           FROM
           OPENROWSET(
           BULK 'blobcontainer/directory/*/*/*.parquet',
           DATA_SOURCE='blobname',
           FORMAT = 'parquet'
           
           ) as [result]
           
           order by node, pricedate, hour}---")

the filepath keyword refers to the name of the directory in the BULK path.

Here's the MS website https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files

You can also make views so that people who like SQL but not parquet files can query the views without having to know anything about the underlying data structure, it'll just look like a SQL Server database to them.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Pace
Solution 2 Dean MacGregor