'Pyspark- calculate percentage of the item appear in the customer transition data [closed]

My dataset looks like this (this table is the transaction table for the customers(it record each transaction that customers made), List column contains the product that customer purchase in one transaction, I want to find for each customer, what is the percentage of the combo product 'b' and 'c' that they purchased

|ID     |AMOUNT |     List     |
|:------|:-----:|-------------:|
|    123|    1  |['a','b','c'] |
|    123|    3  |['a','b','c'] |
|    123|    4  |['b','c']     |
|    123|    4  |['a']         |
|    456|    5  |['a','b','c'] |
|    456|    6  |['b']         |

So the ideal output table I would like to get using pyspark looks like this below

|customer |Amount   | product 'b' and 'c' percentage |
|123      |[1,3,4,4]|   0.75                         |
|456      |[5,6]    |   0.5                          |


Solution 1:[1]

The idea is to group by the ID and then count the number of aggregated rows twice:

  • all rows and
  • rows where List contains the two elements b and c using array_intersect
from pyspark.sql import functions as F
df.groupBy('ID') \
    .agg(
        F.count('AMOUNT').alias('cnt'), 
        F.count(F.when(F.size(F.array_intersect('List', F.array(F.lit('b'),F.lit('c')))) >= 2 ,
            F.col('AMOUNT'))).alias('cnt_bc')
        )\
    .withColumn('result', F.col('cnt_bc') / F.col('cnt')) \
    .show()

prints

+---+---+------+------+
| ID|cnt|cnt_bc|result|
+---+---+------+------+
|123|  4|     3|  0.75|
|456|  2|     1|   0.5|
+---+---+------+------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 werner