'How can I process chunks of data in Spark dataframe
Since my data is really huge in TB's I am trying to process and write in smaller chunks in the below fashion but I don't see any result in s3
val df1 = Seq((1,"Jill"),(2, "John")).toDF("id","name")
val df2 = Seq((1,"accounts"),(2, "finance")).toDF("id","dept")
def joinDataSets(row: Row, df2: DataFrame): Unit = {
val df1 = Seq((row.getAs[java.lang.String]("id"), row.getAs[java.lang.String]("name"))).toDF("id", "name")
df1.join(df2, df1("id") === df2("id"), "left_outer")
.select("*")
.write
.mode(SaveMode.Append)
.option("compression", "snappy")
.parquet(f"s3://{bucket}/test/data")
}
df1.rdd.mapPartitions{
partition => {
partition.map(row=>{
joinDataSets(row, df2)
})
}
}
How can I make this work? Note: both the data sets are huge.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|