'Spark dataframe foreachPartition: sum the elements using pyspark
I am trying to partition spark dataframe and sum elements in each partition using pyspark. But I am unable to do this inside a called function "sumByHour". Basically, I am unable to access dataframe columns inside "sumByHour".
Basically, I am partitioning by "hour" column and trying to sum the elements based on "hour" partition. So expected output is: 6,15,24 for 0,1,2 hour respectively. Tried below with no luck.
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
def sumByHour(ip):
print(ip)
pandasDF = pd.DataFrame({'hour': [0,0,0,1,1,1,2,2,2], 'numlist': [1,2,3,4,5,6,7,8,9]})
myschema = StructType(
[StructField('hour', IntegerType(), False),
StructField('numlist', IntegerType(), False)]
)
myDf = spark.createDataFrame(pandasDF, schema=myschema)
mydf = myDf.repartition(3, "hour")
myDf.foreachPartition(sumByHour)
I am able to solve this with "window.partitionBy". But I want to know if it can be solved by "foreachPartition".
Thanks in Advance,
Sri
Solution 1:[1]
You can use a Window to do that and add the sumByHour as a new column.
from pyspark.sql import functions, Window
w = Window.partitionBy("hour")
myDf = myDf.withColumn("sumByHour", functions.sum("numlist").over(w))
myDf.show()
+----+-------+---------+
|hour|numlist|sumByHour|
+----+-------+---------+
| 1| 4| 15|
| 1| 5| 15|
| 1| 6| 15|
| 2| 7| 24|
| 2| 8| 24|
| 2| 9| 24|
| 0| 1| 6|
| 0| 2| 6|
| 0| 3| 6|
+----+-------+---------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BoomBoxBoy |
