'Filtering a spark dataframe based on date
I have a dataframe of
date, string, string
I want to select dates before a certain period. I have tried the following with no luck
data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime))
I'm getting an error stating the following
org.apache.spark.sql.AnalysisException: resolved attribute(s) date#75 missing from date#72,uid#73,iid#74 in operator !Filter (date#75 < 16508);
As far as I can guess the query is incorrect. Can anyone show me what way the query should be formatted?
I checked that all enteries in the dataframe have values - they do.
Solution 1:[1]
Don't use this as suggested in other answers
.filter(f.col("dateColumn") < f.lit('2017-11-01'))
But use this instead
.filter(f.col("dateColumn") < f.unix_timestamp(f.lit('2017-11-01 00:00:00')).cast('timestamp'))
This will use the TimestampType instead of the StringType, which will be more performant in some cases. For example Parquet predicate pushdown will only work with the latter.
Solution 2:[2]
I find the most readable way to express this is using a sql expression:
df.filter("my_date < date'2015-01-01'")
we can verify this works correctly by looking at the physical plan from .explain()
+- *(1) Filter (isnotnull(my_date#22) && (my_date#22 < 16436))
Solution 3:[3]
In PySpark(python) one of the option is to have the column in unix_timestamp format.We can convert string to unix_timestamp and specify the format as shown below. Note we need to import unix_timestamp and lit function
from pyspark.sql.functions import unix_timestamp, lit
df.withColumn("tx_date", to_date(unix_timestamp(df_cast["date"], "MM/dd/yyyy").cast("timestamp")))
Now we can apply the filters
df_cast.filter(df_cast["tx_date"] >= lit('2017-01-01')) \
.filter(df_cast["tx_date"] <= lit('2017-01-31')).show()
Solution 4:[4]
df=df.filter(df["columnname"]>='2020-01-13')
Solution 5:[5]
We can also use SQL kind of expression inside filter :
Note -> Here I am showing two conditions and a date range for future reference :
ordersDf.filter("order_status = 'PENDING_PAYMENT' AND order_date BETWEEN '2013-07-01' AND '2013-07-31' ")
Solution 6:[6]
imho it should be like this:
import java.util.Date
import java.util.Calendar
import java.sql.Timestamp
import java.sql.Date
val jDate = Calendar.getInstance().getTime()
val sqlDateTime = new java.sql.Timestamp(jDate.getTime())
val sqlDate = new java.sql.Date(jDate.getTime())
data.filter(data("date").gt(sqlDate))
data.filter(data("date").gt(sqlDateTime))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ruurtjan Pul |
| Solution 2 | RobinL |
| Solution 3 | |
| Solution 4 | keikai |
| Solution 5 | Abhishek Sengupta |
| Solution 6 | iamcj |
