'How to filter the parquet dataset on date range
I have dataset of parquet files partitioned based on the year month and then day. The sample dataset is like
| source_id | loaded_at | participant_id | partition_day | partition_month | partition_year |
|---|---|---|---|---|---|
| b | 2021-01-02 12:00:00 | B | 2 | 1 | 2021 |
| c | 2021-01-03 12:30:00 | A | 3 | 1 | 2021 |
| d | 2021-01-04 12:45:00 | C | 4 | 1 | 2021 |
| a | 2021-01-02 13:00:00 | E | 2 | 1 | 2021 |
| b | 021-01-03 13:30:00 | A | 3 | 1 | 2021 |
| c | 2021-01-04 13:45:00 | B | 4 | 1 | 2021 |
What is the best way to filter the data based on data range ?
Solution 1:[1]
Assuming yout loaded_at is a date filed that you used to partition this .
convert the timestamp to unix timstamp ,
from pyspark.sql.functions import *
you can convert the date string to timestamp unix_timestamp(df["loaded_at"])
and then apply filters
df=df.filter(df["loaded_at"]>='2021-01-01')
or
df=df.filter ( (unix_timestamp(df["loaded_at"])>=unix_timestamp('2021-01-01')) && (unix_timestamp(df["loaded_at"])<unix_timestamp('2021-01-05'))
you can also filter based on SQL type
df.filter ( "loaded_at BETWEEN '2021-01-01' AND '2021-01-03' " )
Solution 2:[2]
Though i tried multiple approaches, I eventually went with using the below approach
import datetime
from pyspark.sql.functions import col, concat_ws
from pyspark.sql.types import DateType
start_date = datetime.date(2022, 1, 1)
end_date = datetime.date(2022, 3, 31)
num_of_days = (end_date - start_date).days + 1
df = spark.read.parquet("")\
.where(f"make_date(partition_year, partition_month, partition_day) >= \
'{start_date}'\
AND make_date(partition_year, partition_month, partition_day) <= \
'{end_date}'")
This gives a result with no addition overhead.
The details of all approaches is available @ https://sharathmysore.hashnode.dev/date-range-query-on-partitioned-date-columns-on-pyspark
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Vineesh Vijayan |
| Solution 2 | Sharath Chandra |
