'Iterate over a spark dataframe and create a nested json string
I'm having a dataframe with the below details and I need to iterate over each row to create the below formatted nested json string.
| Employee name | Salary | sick_leave_day | paid_leave_day |
|---|---|---|---|
| Karthi | 20000 | 10-10-2021 | Y |
| Karthi | 20000 | 11-12-2021 | Y |
| Karthi | 20000 | 13-12-2021 | N |
| Rajesh | 25000 | 15-12-2021 | Y |
| Rajesh | 25000 | 17-11-2021 | N |
| Rajesh | 25000 | 10-10-2021 | Y |
The output json should look like below, Here, Name and salary is having unique data and sick_leave_day,paid_leave_day changes for each row.
{
"Name": "Karthi",
"Salary": "20000",
leave_details: [{
"sick_leave_day": "10-10-2021",
"paid_leave_day ": "Y"
},
{
"sick_leave_day": "11-12-2021",
"paid_leave_day ": "Y"
},
{
"sick_leave_day": "13-12-2021",
"paid_leave_day ": "N"
}
]
}
I have tried to iterate over dataframe using foreach method, but I'm unable to get the logic. If I group the name and salary, I can't get the value of name and salary only once. So, please help me to write the logic by iterating over each row to create this logic. Thanks in advance.
Solution 1:[1]
You can use the collect_list aggregation function to group by Employee and get the list of leave days.
For instance you can manually create a JSON with the 2 "leave_day" columns then aggregate using collect_list:
var result = df.withColumn("new_col", concat(lit("{ sick_leave_day:"), df.col("sick_leave_day"), lit(", paid_leave_day:"), df.col("paid_leave_day"), lit("}")))
.groupBy("Employee name", "Salary")
.agg(collect_list("new_col"))
result.show(truncate=false)
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Employee name|Salary|collect_list(new_col) |
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Rajesh |25000 |[{ sick_leave_day:15-12-2021, paid_leave_day:Y}, { sick_leave_day:17-11-2021, paid_leave_day:N}, { sick_leave_day:10-10-2021, paid_leave_day:Y}]|
|Karthi |20000 |[{ sick_leave_day:10-10-2021, paid_leave_day:Y}, { sick_leave_day:11-12-2021, paid_leave_day:Y}, { sick_leave_day:13-12-2021, paid_leave_day:N}]|
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
Solution 2:[2]
Group by Employee name and collect list of structs for leave_details array and first (or max) Salary then save a json:
val result = df.groupBy(col("Employee name").as("name")).agg(
first($"Salary").as("Salary"),
collect_list(struct($"sick_leave_day", $"paid_leave_day")).as("leave_details")
)
result.toJSON.collect.foreach(println(_))
//{"name":"Rajesh","Salary":"25000","leave_details":[{"sick_leave_day":"15-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"17-11-2021","paid_leave_day":"N"},{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"}]}
//{"name":"Karthi","Salary":"20000","leave_details":[{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"},{"sick_leave_day":"11-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"13-12-2021","paid_leave_day":"N"}]}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Fabich |
| Solution 2 | blackbishop |
