'conversion of pyspark dataframe to nested dictionaries

I have a pyspark dataframe and I need to convert it in the dictionary format as shown below.

Here is the pyspark dataframe

import pyspark
from pyspark.sql import Row



data = spark.createDataFrame([Row(name='harvest bowl', tenure='6+',count=4),
                              Row(name='harvest bowl', tenure='6-Mar',count=1),
                              Row(name='harvest bowl', tenure='2-Jan',count=5),
                              Row(name='fish taco', tenure='6+',count=1)])

data.show()

table output:

+------------+------+-----+
|        NAME|TENURE|count|
+------------+------+-----+
|harvest bowl|    6+|    4|
|   fish taco|    6+|    1|
|harvest bowl| 6-Mar|    1|
|harvest bowl| 2-Jan|    5|
+------------+------+-----+

I would like to convert the above pyspark dataframe in the below format

{'fish taco': {'TENURE': {'6+': 1.0}}, 'harvest bowl': {'TENURE': {'6+': 4, '6-Mar': 1, '2-Jan': 5}}}

Can anyone tell me how to do this in pyspark?



Solution 1:[1]

You can use map_from_arrays and collect_list.

pdf = (df.groupby('name')
    .agg(F.map_from_arrays(F.collect_list('tenure'), F.collect_list('count'))
    .alias('tenure'))
    .toPandas())

pdf
#         name                             tenure
# harvest bowl  {'6+': 4, '6-Mar': 1, '2-Jan': 5}
#    fish taco                          {'6+': 1}

Then use Pandas to_dict to get the dictionary.

pdf.set_index('name').to_dict(orient='index')

# {'harvest bowl': {'tenure': {'6+': 4, '6-Mar': 1, '2-Jan': 5}},
#  'fish taco': {'tenure': {'6+': 1}}}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Emma