'How can I process chunks of data in Spark dataframe

Since my data is really huge in TB's I am trying to process and write in smaller chunks in the below fashion but I don't see any result in s3

val df1 = Seq((1,"Jill"),(2, "John")).toDF("id","name")
val df2 = Seq((1,"accounts"),(2, "finance")).toDF("id","dept")

def joinDataSets(row: Row, df2: DataFrame): Unit = {
        val df1 = Seq((row.getAs[java.lang.String]("id"), row.getAs[java.lang.String]("name"))).toDF("id", "name")
        df1.join(df2, df1("id") === df2("id"), "left_outer")
          .select("*")
          .write
          .mode(SaveMode.Append)
          .option("compression", "snappy")
          .parquet(f"s3://{bucket}/test/data")
      }

df1.rdd.mapPartitions{
      partition => {
        partition.map(row=>{
          joinDataSets(row, df2)
        })
      }
    }

How can I make this work? Note: both the data sets are huge.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How can I process chunks of data in Spark dataframe

Sources

Related Questions