'How to filter the parquet dataset on date range

I have dataset of parquet files partitioned based on the year month and then day. The sample dataset is like

source_id loaded_at participant_id partition_day partition_month partition_year
b 2021-01-02 12:00:00 B 2 1 2021
c 2021-01-03 12:30:00 A 3 1 2021
d 2021-01-04 12:45:00 C 4 1 2021
a 2021-01-02 13:00:00 E 2 1 2021
b 021-01-03 13:30:00 A 3 1 2021
c 2021-01-04 13:45:00 B 4 1 2021

What is the best way to filter the data based on data range ?



Solution 1:[1]

Assuming yout loaded_at is a date filed that you used to partition this .

convert the timestamp to unix timstamp ,

    from pyspark.sql.functions import *

you can convert the date string to timestamp unix_timestamp(df["loaded_at"])

and then apply filters

df=df.filter(df["loaded_at"]>='2021-01-01')
 

or

df=df.filter ( (unix_timestamp(df["loaded_at"])>=unix_timestamp('2021-01-01')) && (unix_timestamp(df["loaded_at"])<unix_timestamp('2021-01-05'))

you can also filter based on SQL type

df.filter ( "loaded_at BETWEEN '2021-01-01' AND '2021-01-03' " ) 

Solution 2:[2]

Though i tried multiple approaches, I eventually went with using the below approach

import datetime
from pyspark.sql.functions import col, concat_ws
from pyspark.sql.types import DateType

start_date = datetime.date(2022, 1, 1)
end_date = datetime.date(2022, 3, 31)
num_of_days = (end_date - start_date).days + 1

df = spark.read.parquet("")\
.where(f"make_date(partition_year, partition_month, partition_day) >= \
                '{start_date}'\
            AND make_date(partition_year, partition_month, partition_day) <= \
                '{end_date}'")

This gives a result with no addition overhead.

The details of all approaches is available @ https://sharathmysore.hashnode.dev/date-range-query-on-partitioned-date-columns-on-pyspark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vineesh Vijayan
Solution 2 Sharath Chandra