'Filter a spark dataframe with a greater than and a less than of list of dates
I have a dataframe with the fields from_date and to_date:
(2017-01-10 2017-01-14)
(2017-01-03 2017-01-13)
and a List of dates
2017-01-05,
2017-01-12,
2017-01-13,
2017-01-15
The idea is to retrieve from the table all the rows in which that date list is between from_date and to_date.
Expected output :
the same dataframe but only the rows in whose (from_date and to_date) are in the range (<= or >=) of the values of the date list . I tried so far the Nikk recommendation :
Filter a spark dataframe with a greater than and a less than of list of dates
but Im need to compare with the whole list of dates , something like that :
spark.sql("select * from dataframe_table where from_date >= (select date from date_list) AND to_date <= (select date from date_list)")
Solution 1:[1]
Your question is slightly confusing for me so I have provided two scenario based code.
1) If you want to filter date which are in between range of list that you have provided such as from Jan 5th 2017 to Jan 15th 2017 so for that case below code snippet work.
//Created dataframe view for both data
Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date",'from_date.cast("date")).withColumn("to_date",'to_date.cast("date")).createOrReplaceTempView("date_table")
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")
spark.sql("select * from date_table where (from_date BETWEEN (select min(cast(list as date)) from date_list) and (select max(cast(list as date)) from date_list)) and (to_date between (select min(cast(list as date)) from date_list) and (select max(cast(list as date)) from date_list))").show()
+----------+----------+
|from_date| to_date|
+----------+----------+
|2017-01-10|2017-01-14|
+----------+----------+
2) Or if you want to filter date from dataframe where to_date and end_date are not in list of date provided. So as per data example you have provided there will be no date in between list. For such case below code will work.
//Created dataframe view for both data
Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date",'from_date.cast("date")).withColumn("to_date",'to_date.cast("date")).createOrReplaceTempView("date_table")
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")
spark.sql("select * from date_table where from_date in (select cast(list as date) from date_list) and to_date in (select cast(list as date) from date_list)").show()
+----------+----------+
|from_date| to_date|
+----------+----------+
| | |
+----------+----------+
Please let me know if I have missed anything.
Solution 2:[2]
Please check this:
//Creating DataFrame with Column from_date and to_date, you can ignore this step if you have dataframe
scala> val df = Seq(("2017-01-10", "2017-01-14"),("2017-01-03","2017-01-13")).toDF("from_date","to_date").withColumn("from_date", col("from_date").cast("date")).withColumn("to_date",col("to_date").cast("date"))
df: org.apache.spark.sql.DataFrame = [from_date: date, to_date: date]
scala> df.show()
+----------+----------+
| from_date| to_date|
+----------+----------+
|2017-01-10|2017-01-14|
|2017-01-03|2017-01-13|
+----------+----------+
//creating temparary View for dataframe "df" so that we can use it in spark sql.
scala> df.createOrReplaceTempView("dataframe_table")
//Converting List into Temp view
List("2017-01-05","2017-01-12","2017-01-13","2017-01-15").toDF("list").createOrReplaceTempView("date_list")
//Query to retrive all data from dataframe where from_date and to_date are in range of list.
scala> val output = spark.sql("select * from dataframe_table where from_date >= (select min(cast(list as date)) from date_list) AND to_date <= (select max(cast(list as date)) from date_list)")
output: org.apache.spark.sql.DataFrame = [from_date: date, to_date: date]
scala> output.show()
+----------+----------+
| from_date| to_date|
+----------+----------+
|2017-01-10|2017-01-14|
+----------+----------+
Solution 3:[3]
If you want to compare multiple rows of a table with multiple rows of another table (let's consider your list of dates as a table with a single column), you can use a join on both tables. Typically you would test to columns of the tables on equality. In this case, your test is a bit more special since you want to compare two columns of your first table with one column in the second. You can use datediff for this:
scala> val df1 = Seq(("2017-01-10", "2017-01-14")).toDF("start_date","end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [start_date: date, end_date: date]
scala> val df2 = Seq("2017-01-5", "2017-01-12","2017-01-13", "2017-01-15").toDF("from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [from_date: date]
scala> df2.join(df1, datediff('from_date,'start_date) > 0) && datediff('from_date,'end_date) < 0)).show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nikhil Suthar |
| Solution 2 | Nikhil Suthar |
| Solution 3 | moe |
