'Spark dataframe foreachPartition: sum the elements using pyspark

I am trying to partition spark dataframe and sum elements in each partition using pyspark. But I am unable to do this inside a called function "sumByHour". Basically, I am unable to access dataframe columns inside "sumByHour".

Basically, I am partitioning by "hour" column and trying to sum the elements based on "hour" partition. So expected output is: 6,15,24 for 0,1,2 hour respectively. Tried below with no luck.

from pyspark.sql.functions import * 
from pyspark.sql.types import *

import pandas as pd

def sumByHour(ip):
    print(ip)

pandasDF = pd.DataFrame({'hour': [0,0,0,1,1,1,2,2,2], 'numlist': [1,2,3,4,5,6,7,8,9]})
myschema = StructType(
                    [StructField('hour', IntegerType(), False),
                     StructField('numlist', IntegerType(), False)] 
                  )
 myDf = spark.createDataFrame(pandasDF, schema=myschema)
 mydf = myDf.repartition(3, "hour")
 myDf.foreachPartition(sumByHour)

I am able to solve this with "window.partitionBy". But I want to know if it can be solved by "foreachPartition".

Thanks in Advance,

Sri



Solution 1:[1]

You can use a Window to do that and add the sumByHour as a new column.

from pyspark.sql import functions, Window

w = Window.partitionBy("hour")

myDf = myDf.withColumn("sumByHour", functions.sum("numlist").over(w))
myDf.show()

+----+-------+---------+
|hour|numlist|sumByHour|
+----+-------+---------+
|   1|      4|       15|
|   1|      5|       15|
|   1|      6|       15|
|   2|      7|       24|
|   2|      8|       24|
|   2|      9|       24|
|   0|      1|        6|
|   0|      2|        6|
|   0|      3|        6|
+----+-------+---------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 BoomBoxBoy