'Iterate over a spark dataframe and create a nested json string

I'm having a dataframe with the below details and I need to iterate over each row to create the below formatted nested json string.

Employee name	Salary	sick_leave_day	paid_leave_day
Karthi	20000	10-10-2021	Y
Karthi	20000	11-12-2021	Y
Karthi	20000	13-12-2021	N
Rajesh	25000	15-12-2021	Y
Rajesh	25000	17-11-2021	N
Rajesh	25000	10-10-2021	Y

The output json should look like below, Here, Name and salary is having unique data and sick_leave_day,paid_leave_day changes for each row.

{
    "Name": "Karthi",
    "Salary": "20000",
    leave_details: [{
            "sick_leave_day": "10-10-2021",
            "paid_leave_day ": "Y"
        },
        {
            "sick_leave_day": "11-12-2021",
            "paid_leave_day ": "Y"
        },
        {
            "sick_leave_day": "13-12-2021",
            "paid_leave_day ": "N"
        }
    ]
}

I have tried to iterate over dataframe using foreach method, but I'm unable to get the logic. If I group the name and salary, I can't get the value of name and salary only once. So, please help me to write the logic by iterating over each row to create this logic. Thanks in advance.

Solution 1:^[1]

You can use the collect_list aggregation function to group by Employee and get the list of leave days.

For instance you can manually create a JSON with the 2 "leave_day" columns then aggregate using collect_list:

var result = df.withColumn("new_col", concat(lit("{ sick_leave_day:"), df.col("sick_leave_day"), lit(", paid_leave_day:"), df.col("paid_leave_day"), lit("}")))
    .groupBy("Employee name", "Salary")
    .agg(collect_list("new_col"))

result.show(truncate=false)

+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Employee name|Salary|collect_list(new_col)                                                                                                                           |
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Rajesh       |25000 |[{ sick_leave_day:15-12-2021, paid_leave_day:Y}, { sick_leave_day:17-11-2021, paid_leave_day:N}, { sick_leave_day:10-10-2021, paid_leave_day:Y}]|
|Karthi       |20000 |[{ sick_leave_day:10-10-2021, paid_leave_day:Y}, { sick_leave_day:11-12-2021, paid_leave_day:Y}, { sick_leave_day:13-12-2021, paid_leave_day:N}]|
+-------------+------+------------------------------------------------------------------------------------------------------------------------------------------------+

Solution 2:^[2]

Group by Employee name and collect list of structs for leave_details array and first (or max) Salary then save a json:

val result = df.groupBy(col("Employee name").as("name")).agg(
    first($"Salary").as("Salary"),
    collect_list(struct($"sick_leave_day", $"paid_leave_day")).as("leave_details")
)

result.toJSON.collect.foreach(println(_))

//{"name":"Rajesh","Salary":"25000","leave_details":[{"sick_leave_day":"15-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"17-11-2021","paid_leave_day":"N"},{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"}]}
//{"name":"Karthi","Salary":"20000","leave_details":[{"sick_leave_day":"10-10-2021","paid_leave_day":"Y"},{"sick_leave_day":"11-12-2021","paid_leave_day":"Y"},{"sick_leave_day":"13-12-2021","paid_leave_day":"N"}]}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Fabich
Solution 2	blackbishop

'Iterate over a spark dataframe and create a nested json string

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]