'How to get the occurence rate of the specific values with Apache Spark

I have the raw data DataFrame like that:

+-----------+--------------------+------+
|device     | timestamp          | value|
+-----------+--------------------+------+
|   device_A|2022-01-01 18:00:01 |   100|
|   device_A|2022-01-01 18:00:02 |    99|
|   device_A|2022-01-01 18:00:03 |   100|
|   device_A|2022-01-01 18:00:04 |   102|
|   device_A|2022-01-01 18:00:05 |   100|
|   device_A|2022-01-01 18:00:06 |    99|
|   device_A|2022-01-01 18:00:11 |    98|
|   device_A|2022-01-01 18:00:12 |   100|
|   device_A|2022-01-01 18:00:13 |   100|
|   device_A|2022-01-01 18:00:15 |   101|
|   device_A|2022-01-01 18:00:17 |   101|

I'd like to aggregate them and to build the listed 10 s aggregation like that:

+-----------+--------------------+------------+-------+
|device     | windowtime         |      values| counts|
+-----------+--------------------+------------+-------+
|   device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
|   device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|

To plot a heat-map graph of the values later.

I have succeed with getting the values column but not clear how to calculate the corresponding counts

.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc)))

How can I do the weighted list aggregation in Apache Spark?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source