'How to do value counts with normalize in pyspark with group by
I have a pyspark dataframe that looks like this:
import pandas as pd
so = pd.DataFrame({'id': ['a','a','a','a','b','b','b','b','c','c','c','c'],
'time': [1,2,3,4,1,2,3,4,1,2,3,4],
'group':['A','A','A','A','A','A','A','A','B','B','B','B'],
'value':['S','C','C','C', 'S','C','H', 'H', 'S','C','C','C']})
I would like to calculate the frequency (as a percentage) of value with in each group
The output should look like this:
group value perc
0 A C 0.50
1 A H 0.25
2 A S 0.25
3 B C 0.75
4 B S 0.25
The pandas equivalent would be:
so.groupby('group')['value'].value_counts(normalize=True).to_frame('perc').reset_index()
but how can I do the same operation in pyspark ?
Solution 1:[1]
You can do this by generating a count_agg on group level , which can be joined further with the input DataFrame & finally aggregate to get the required normalised value_counts as below -
Data Preparation
df = pd.DataFrame({'id': ['a','a','a','a','b','b','b','b','c','c','c','c'],
'time': [1,2,3,4,1,2,3,4,1,2,3,4],
'group':['A','A','A','A','A','A','A','A','B','B','B','B'],
'value':['S','C','C','C', 'S','C','H', 'H', 'S','C','C','C']})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+---+----+-----+-----+
| id|time|group|value|
+---+----+-----+-----+
| a| 1| A| S|
| a| 2| A| C|
| a| 3| A| C|
| a| 4| A| C|
| b| 1| A| S|
| b| 2| A| C|
| b| 3| A| H|
| b| 4| A| H|
| c| 1| B| S|
| c| 2| B| C|
| c| 3| B| C|
| c| 4| B| C|
+---+----+-----+-----+
Group Count Aggregation
count_agg = sparkDF.groupBy('group').agg(F.count(F.col('group')).alias('group_count'))
count_agg.show()
+-----+-----------+
|group|group_count|
+-----+-----------+
| B| 4|
| A| 8|
+-----+-----------+
Join & Aggregation - Value Counts
sparkDF_agg = sparkDF.join(count_agg
,sparkDF['group'] == count_agg['group']
,'inner'
).select(sparkDF['*'],count_agg['group_count'])\
.groupBy(['group','value','group_count']).agg(F.count(F.col('value')).alias('count'))\
.withColumn('perc',F.col('count')/F.col('group_count'))\
.orderBy(['group','value'])
sparkDF_agg.show()
+-----+-----+-----------+-----+----+
|group|value|group_count|count|perc|
+-----+-----+-----------+-----+----+
| A| C| 8| 4| 0.5|
| A| H| 8| 2|0.25|
| A| S| 8| 2|0.25|
| B| C| 4| 3|0.75|
| B| S| 4| 1|0.25|
+-----+-----+-----------+-----+----+
Solution 2:[2]
s1 = (
#get count of each value
so.groupby('group','value').count()
#create arrays of count, values and the sum of each count in a group
.groupby('group').agg(collect_list('count').alias('perc'),collect_list('value').alias('value'),sum('count').alias('sum'))
#Using higher order function compute the percentage count
.withColumn('perc', expr("transform(perc,(x,i)->x/sum)")).drop('sum')
#zip arrays and explode.
.selectExpr("inline(arrays_zip(array_repeat(group,size(value)), value,perc))")
)
s1.show(truncate=False)
+---+-----+----+
| 0|value|perc|
+---+-----+----+
| B| S|0.25|
| B| C|0.75|
| A| S|0.25|
| A| C| 0.5|
| A| H|0.25|
+---+-----+----+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | wwnde |
