'How to get the occurence rate of the specific values with Apache Spark
I have the raw data DataFrame like that:
+-----------+--------------------+------+
|device | timestamp | value|
+-----------+--------------------+------+
| device_A|2022-01-01 18:00:01 | 100|
| device_A|2022-01-01 18:00:02 | 99|
| device_A|2022-01-01 18:00:03 | 100|
| device_A|2022-01-01 18:00:04 | 102|
| device_A|2022-01-01 18:00:05 | 100|
| device_A|2022-01-01 18:00:06 | 99|
| device_A|2022-01-01 18:00:11 | 98|
| device_A|2022-01-01 18:00:12 | 100|
| device_A|2022-01-01 18:00:13 | 100|
| device_A|2022-01-01 18:00:15 | 101|
| device_A|2022-01-01 18:00:17 | 101|
I'd like to aggregate them and to build the listed 10 s aggregation like that:
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
To plot a heat-map graph of the values later.
I have succeed with getting the values column but not clear how to calculate the corresponding counts
.withColumn("values",collect_list(col("value")).over(Window.partitionBy($"device").orderBy($"timestamp".desc)))
How can I do the weighted list aggregation in Apache Spark?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
