'How to give multiple conditions in pyspark dataframe filter?
I have to apply a filter with multiple conditions using OR on a pyspark dataframe.
I am trying to create a separate dataframe. Date value must be less than max_date or Date must be None.
How to do it?
I tried below 3 options but they all failed.
df.filter(df['Date'] < max_date or df['Date'] == None).createOrReplaceTempView("Final_dataset")
final_df = df.filter(df['Date'] != max_date | df['Date'] is None)
final_df = df.filter(df['Date'] != max_date or df['Date'] is None)
Solution 1:[1]
final_df = df.filter((df.Date < max_date) | (df.Date.isNull()))
Regular logical python operators don't work in Pyspark conditions; you need to use bitwise operators. They can also be a bit tricky so you might need extra parenthesis to disambiguate the expression.
Have a look here: Boolean operators vs Bitwise operators
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Gustavo Puma |
