'Better/Efficient way to filter out Spark Dataframe rows with multiple conditions
I have a dataframe look like this below
id pub_date version unique_id c_id p_id type source
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 2 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 2 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 3 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 3 64WP-UI-POLI 002 P02 org internet
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 5 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 1 64WP-UI-CFGT 012 K21 location internet
::
::
I want to groupby id column and only keep the highest number from version column but here is a catch, I also need to take into consideration for the type column (which only have two types, org or location). The final dataframe will look like this below
id pub_date version unique_id c_id p_id type source
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 5 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 14 64WP-UI-CFGT 012 K21 location internet
::
::
My current approach is separate the dataframe into two different ones, first one is org under type column, the other one is location under type column.Then I am using groupby, withColumn but my dataframe is huge. And I am wondering are there more efficient ways to do this maybe in one line of code? Rather than need to separate them into two dataframe then merger them back together?
Thanks!
Solution 1:[1]
dense_rank() can be used to find out top versions based on id & type. This can be used to retain only the top record in each group.
input.withColumn("rank", dense_rank() over (Window.partitionBy($"id",$"type").orderBy($"version".desc)))
.filter($"rank" === 1)
.drop($"rank")
Output:
+------+--------+-------+------------+---+----+--------+--------+
|id |pub_date|version|unique_id |_id|p_id|type |source |
+------+--------+-------+------------+---+----+--------+--------+
|lni001|20220301|5 |64WP-UI-CFGT|012|K21 |location|internet|
|lni001|20220301|85 |64WP-UI-POLI|002|P02 |org |internet|
|lni001|20220301|85 |64WP-UI-POLI|002|P02 |org |internet|
|lni002|20220301|1 |64WP-UI-CFGT|012|K21 |location|internet|
+------+--------+-------+------------+---+----+--------+--------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | vdep |
