'Reading partitioned parquet files in DuckDB

Background:

  • DuckDB allows for direct querying for parquet files. e.g. con.execute("Select * from 'Hierarchy.parquet')

  • Parquet allows files to be partitioned by column values. When a parquet file is paritioned a top level FOLDER is created with the name of the parquet file and subfolders for the column values and these subfolders then contain the actual parquet data files. e.g. Hierarchy.parquet (folder) --> date=20220401 (subfolder) --> part1.parquet

Expected behavior

DuckDB to query partitioned AND unpartitioned parquet files.

Observed behaviour

DuckDB fails when querying partitioned parquet files and works with unpartitioned parquet files.

con.execute("Select * from 'Hierarchy.parquet'") fails with RuntimeError: IO Error: No files found that match the pattern "Hierarchy.parquet" when Hierarchy.parquet is partitioned.

querying the underlying individuals datafiles works fine: con.execute("Select * from 'Hierarchy.parquet/date=20220401/part1.parquet'")

Is there a way to query partitioned parquet files with DuckDB? Or is this a limitation/bug?



Solution 1:[1]

This is discussed as a DuckDB issue/feature request in

"Support hive-style partitioning of parquet archives" https://github.com/duckdb/duckdb/issues/2186.

Suggested workaround is:

  • Open the parquet file as an arrow/pyarrow dataset with a partitioning parameter
  • Register the dataset as a view in DuckDB
  • Query the view
import pyarrow.dataset as ds
import duckdb

con = duckdb.connect()

dataset = ds.dataset('Hierarchy.parquet", format="parquet", partitioning="hive")
con.register_arrow("Hierarchy", dataset)
con.execute("Select * from Hierarchy").df()

Direct support by DuckDB would be better... but this works fine and is easy.

Solution 2:[2]

Our objective here is twofold:

  1. Take advantage of Parquet filters to load part of a dataset corresponding to a partition key.
  2. Use DuckDB to write queries on that filtered dataset.

I created a toy Parquet dataset of city data partitioned on state. The source csv file looked like this (there are twenty five rows in total):

rank,city,state,population,2000_2013_growth
1,New York,New York,8405837,4.8%
2,Los Angeles,California,3884307,4.8%
3,Chicago,Illinois,2718782,-6.1%

Import the packages I need:

import duckdb
import pandas as pd
import pyarrow.parquet as pq

My first objective it to get the a subset of the partitioned data into something DuckDB can query, in this example a Pandas dataframe:

parquet_file = f'{parquet_folder}/city_data.parquet'
filter_column = 'State'
states = ['California', 'Michigan', 'New York']
ds = pq.ParquetDataset(parquet_file, filters=[('state','in', states)])
df = ds.read().to_pandas()

Next I'll build a query, a simple example would be:

query = """SELECT * FROM df WHERE State = 'California'"""
results_df = duckdb.query(query).to_df()

I just started working with Parquet and DuckDB, so I claim no expertise and this may not be the best approach. Happy to see others add their suggestion for improving it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 tomanizer
Solution 2