'Find number of null records between two non-null records in scala dataframe
I have a dataframe as shown below.
| ID | date | sig01_diff |
+-----+------------+------------+
| 123 | 2019-11-04 | 93668 |
| 123 | 2019-11-05 | 49350 |
| 123 | 2019-11-07 | null |
| 123 | 2019-11-08 | 11069 |
| 123 | 2019-11-09 | 33203 |
| 123 | 2019-11-11 | 47927 |
| 123 | 2020-01-21 | null |
| 123 | 2020-01-22 | null |
| 123 | 2020-01-23 | 33908 |
| 123 | 2020-01-24 | 61603 |
| 123 | 2020-01-27 | 33613 |
| 123 | 2020-01-28 | 27514 |
| 123 | 2020-01-29 | null |
| 123 | 2020-01-30 | null |
| 123 | 2020-02-11 | null |
| 123 | 2020-02-12 | null |
| 123 | 2020-02-13 | null |
| 123 | 2020-02-14 | null |
| 123 | 2020-02-15 | 65625 |
| 123 | 2020-02-17 | 13354 |
| 123 | 2020-02-18 | null |
| 123 | 2020-02-19 | 69069 |
+-----+------------+------------+
I have to get number of null record preceding to a record as shown below.
| ID | date | sig01_diff |null_count |
+-----+------------+------------+-----------+
| 123 | 2019-11-04 | 93668 | 00 |
| 123 | 2019-11-05 | 49350 | 00 |
| 123 | 2019-11-07 | null | 00 |
| 123 | 2019-11-08 | 11069 | 01 |
| 123 | 2019-11-09 | 33203 | 00 |
| 123 | 2019-11-11 | 47927 | 00 |
| 123 | 2020-01-21 | null | 00 |
| 123 | 2020-01-22 | null | 00 |
| 123 | 2020-01-23 | 33908 | 02 |
| 123 | 2020-01-24 | 61603 | 00 |
| 123 | 2020-01-27 | 33613 | 00 |
| 123 | 2020-01-28 | 27514 | 00 |
| 123 | 2020-01-29 | null | 00 |
| 123 | 2020-01-30 | null | 00 |
| 123 | 2020-02-11 | null | 00 |
| 123 | 2020-02-12 | null | 00 |
| 123 | 2020-02-13 | null | 00 |
| 123 | 2020-02-14 | null | 00 |
| 123 | 2020-02-15 | 65625 | 06 |
| 123 | 2020-02-17 | 13354 | 00 |
| 123 | 2020-02-18 | null | 00 |
| 123 | 2020-02-19 | 69069 | 01 |
+-----+------------+------------+-----------+
As shown above the new column will have a count of null records preceding to that record. for example for below dates:
2019-11-08
2020-02-15
Using window function and unboundpreceding, I am able to find count of null records incrementally within a window. But my requirement is within a window the count of null records between two non-null records.
How could I achieve this ? Any leads appreciated!
Solution 1:[1]
You should use window function with rowsBetween defined with Window.currentRow - 1 as the upper bound. Here is an example:
import org.apache.spark.sql.functions.{count, when, col}
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = List((1, Some(1)), (2, None), (3, Some(3)), (4, None), (5, Some(4))).toDF("id", "nullable_value")
val w = Window
.orderBy(col("id"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)
df.withColumn("not_null_count", count(when(col("nullable_value").isNull, 1)).over(w)).show()
+---+--------------+--------------+
| id|nullable_value|not_null_count|
+---+--------------+--------------+
| 1| 1| 0|
| 2| null| 0|
| 3| 3| 1|
| 4| null| 1|
| 5| 4| 2|
+---+--------------+--------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Gabio |

