'Apply a specific filter condition for each group in Spark
I have a data frame as below:
id |original_date |date1 |date2 |name
1 |03-30-2022 |03-29-2022 |04-02-2022 | John
1 |03-27-2022 |03-29-2022 |04-02-2022 | Mary
2 |04-01-2022 |03-29-2022 |04-02-2022 | Joe
2 |03-30-2022 |04-02-2022 |04-08-2022 | Susan
3 |04-03-2022 |04-02-2022 |04-08-2022 | Mallory
I am looking to get the following resultant dataframe such that for each group of id, I want to apply a filter condition such that date1 < original_date <= date2.
id |original_date |date1 |date2 |name
1 |03-30-2022 |03-29-2022 |04-02-2022 | John
2 |04-01-2022 |03-29-2022 |04-02-2022 | Joe
3 |04-03-2022 |04-02-2022 |04-08-2022 | Mallory
How can I do this?
Solution 1:[1]
Isn't that just a straightforward filter condition, or am I missing something?
from pyspark.sql import functions as F
(df
.where(
(F.col('date1') < F.col('original_date'))
& (F.col('original_date') <= F.col('date2'))
)
.show()
)
+---+-------------+----------+----------+-------+
| id|original_date| date1| date2| name|
+---+-------------+----------+----------+-------+
| 1| 03-30-2022|03-29-2022|04-02-2022| John|
| 2| 04-01-2022|03-29-2022|04-02-2022| Joe|
| 3| 04-03-2022|04-02-2022|04-08-2022|Mallory|
+---+-------------+----------+----------+-------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | pltc |
