'Pyspark- calculate percentage of the item appear in the customer transition data [closed]
My dataset looks like this (this table is the transaction table for the customers(it record each transaction that customers made), List column contains the product that customer purchase in one transaction, I want to find for each customer, what is the percentage of the combo product 'b' and 'c' that they purchased
|ID |AMOUNT | List |
|:------|:-----:|-------------:|
| 123| 1 |['a','b','c'] |
| 123| 3 |['a','b','c'] |
| 123| 4 |['b','c'] |
| 123| 4 |['a'] |
| 456| 5 |['a','b','c'] |
| 456| 6 |['b'] |
So the ideal output table I would like to get using pyspark looks like this below
|customer |Amount | product 'b' and 'c' percentage |
|123 |[1,3,4,4]| 0.75 |
|456 |[5,6] | 0.5 |
Solution 1:[1]
The idea is to group by the ID and then count the number of aggregated rows twice:
- all rows and
- rows where
Listcontains the two elementsbandcusing array_intersect
from pyspark.sql import functions as F
df.groupBy('ID') \
.agg(
F.count('AMOUNT').alias('cnt'),
F.count(F.when(F.size(F.array_intersect('List', F.array(F.lit('b'),F.lit('c')))) >= 2 ,
F.col('AMOUNT'))).alias('cnt_bc')
)\
.withColumn('result', F.col('cnt_bc') / F.col('cnt')) \
.show()
prints
+---+---+------+------+
| ID|cnt|cnt_bc|result|
+---+---+------+------+
|123| 4| 3| 0.75|
|456| 2| 1| 0.5|
+---+---+------+------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | werner |
