'Find number of null records between two non-null records in scala dataframe

I have a dataframe as shown below.

|  ID | date       | sig01_diff |
+-----+------------+------------+
| 123 | 2019-11-04 | 93668      |
| 123 | 2019-11-05 | 49350      |
| 123 | 2019-11-07 | null       |
| 123 | 2019-11-08 | 11069      |
| 123 | 2019-11-09 | 33203      |
| 123 | 2019-11-11 | 47927      |
| 123 | 2020-01-21 | null       |
| 123 | 2020-01-22 | null       |
| 123 | 2020-01-23 | 33908      |
| 123 | 2020-01-24 | 61603      |
| 123 | 2020-01-27 | 33613      |
| 123 | 2020-01-28 | 27514      |
| 123 | 2020-01-29 | null       |
| 123 | 2020-01-30 | null       |
| 123 | 2020-02-11 | null       |
| 123 | 2020-02-12 | null       |
| 123 | 2020-02-13 | null       |
| 123 | 2020-02-14 | null       |
| 123 | 2020-02-15 | 65625      |
| 123 | 2020-02-17 | 13354      |
| 123 | 2020-02-18 | null       |
| 123 | 2020-02-19 | 69069      |
+-----+------------+------------+

I have to get number of null record preceding to a record as shown below.

|  ID | date       | sig01_diff |null_count |
+-----+------------+------------+-----------+
| 123 | 2019-11-04 | 93668      | 00        |
| 123 | 2019-11-05 | 49350      | 00        |
| 123 | 2019-11-07 | null       | 00        |
| 123 | 2019-11-08 | 11069      | 01        |
| 123 | 2019-11-09 | 33203      | 00        |
| 123 | 2019-11-11 | 47927      | 00        |
| 123 | 2020-01-21 | null       | 00        |
| 123 | 2020-01-22 | null       | 00        |
| 123 | 2020-01-23 | 33908      | 02        |
| 123 | 2020-01-24 | 61603      | 00        |
| 123 | 2020-01-27 | 33613      | 00        |
| 123 | 2020-01-28 | 27514      | 00        |
| 123 | 2020-01-29 | null       | 00        |
| 123 | 2020-01-30 | null       | 00        |
| 123 | 2020-02-11 | null       | 00        |
| 123 | 2020-02-12 | null       | 00        |
| 123 | 2020-02-13 | null       | 00        |
| 123 | 2020-02-14 | null       | 00        |
| 123 | 2020-02-15 | 65625      | 06        |
| 123 | 2020-02-17 | 13354      | 00        |
| 123 | 2020-02-18 | null       | 00        |
| 123 | 2020-02-19 | 69069      | 01        |
+-----+------------+------------+-----------+

As shown above the new column will have a count of null records preceding to that record. for example for below dates:

2019-11-08
2020-02-15

Using window function and unboundpreceding, I am able to find count of null records incrementally within a window. But my requirement is within a window the count of null records between two non-null records.

enter image description here

How could I achieve this ? Any leads appreciated!



Solution 1:[1]

You should use window function with rowsBetween defined with Window.currentRow - 1 as the upper bound. Here is an example:

import org.apache.spark.sql.functions.{count, when, col}
import org.apache.spark.sql.expressions.Window
import spark.implicits._

val df = List((1, Some(1)), (2, None), (3, Some(3)), (4, None), (5, Some(4))).toDF("id", "nullable_value")

val w = Window
  .orderBy(col("id"))
  .rowsBetween(Window.unboundedPreceding, Window.currentRow - 1)

df.withColumn("not_null_count", count(when(col("nullable_value").isNull, 1)).over(w)).show()

+---+--------------+--------------+
| id|nullable_value|not_null_count|
+---+--------------+--------------+
|  1|             1|             0|
|  2|          null|             0|
|  3|             3|             1|
|  4|          null|             1|
|  5|             4|             2|
+---+--------------+--------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gabio