'conversion of pyspark dataframe to nested dictionaries
I have a pyspark dataframe and I need to convert it in the dictionary format as shown below.
Here is the pyspark dataframe
import pyspark
from pyspark.sql import Row
data = spark.createDataFrame([Row(name='harvest bowl', tenure='6+',count=4),
Row(name='harvest bowl', tenure='6-Mar',count=1),
Row(name='harvest bowl', tenure='2-Jan',count=5),
Row(name='fish taco', tenure='6+',count=1)])
data.show()
table output:
+------------+------+-----+
| NAME|TENURE|count|
+------------+------+-----+
|harvest bowl| 6+| 4|
| fish taco| 6+| 1|
|harvest bowl| 6-Mar| 1|
|harvest bowl| 2-Jan| 5|
+------------+------+-----+
I would like to convert the above pyspark dataframe in the below format
{'fish taco': {'TENURE': {'6+': 1.0}}, 'harvest bowl': {'TENURE': {'6+': 4, '6-Mar': 1, '2-Jan': 5}}}
Can anyone tell me how to do this in pyspark?
Solution 1:[1]
You can use map_from_arrays and collect_list.
pdf = (df.groupby('name')
.agg(F.map_from_arrays(F.collect_list('tenure'), F.collect_list('count'))
.alias('tenure'))
.toPandas())
pdf
# name tenure
# harvest bowl {'6+': 4, '6-Mar': 1, '2-Jan': 5}
# fish taco {'6+': 1}
Then use Pandas to_dict to get the dictionary.
pdf.set_index('name').to_dict(orient='index')
# {'harvest bowl': {'tenure': {'6+': 4, '6-Mar': 1, '2-Jan': 5}},
# 'fish taco': {'tenure': {'6+': 1}}}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Emma |
