'Spark - parallel computation for different dataframes

A premise: this question might sound idiotic, but I guess I fell into confusion and/ignorance.

The question is: does Spark already optmize its physical plan to execute computations on unrelated dataframes to be in parallel? If not, would it be advisable to try and parallelize such processes? Example below.

Let's assume I have the following scenario:

val df1 = read table into dataframe
val df2 = read another table into dataframe

val aTransformationOnDf1 = df1.filter(condition).doSomething

val aSubSetOfTransformationOnDf1 = aTransformationOnDf1.doSomeOperations

// Push to Kafka
aSubSetOfTransformationOnDf1.toJSON.pushToKafkaTopic

val anotherTransformationOnDf1WithDf2 = df1.filter(anotherCondition).join(df2).doSomethingElse
val yetAnotherTransformationOnDf1WithDf2 = df1.filter(aThirdCondition).join(df2).doAnotherThing 

val unionAllTransformation = aTransformationOnDf1
  .union(anotherTransformationOnDf1WithDf2)
  .union(yetAnotherTransformationOnDf1WithDf2)
 
unionAllTransformation.write.mode(whatever).partitionBy(partitionColumn).save(wherever)

Basically I have two initial dataframes. One is an avent log with past events and new events to process. As an example:

  • a subset of these new events must be processed and pushed to Kafka.
  • a subset of the past events could have updates, so they must be processed alone
  • another subset of the past events could have another kind of updates, so they must be processed alone

In the end, all processed events are unified in one dataframe to be written back to the events' log table.

Question: does Spark process the different subsets in parallel or sequentially (and onyl computation within each individual dataframe is performed distributedly)? If not, could we enforce parallel computation of each individual subset before the union? I know Scala has a Future propery, though I never used it. Something like>

    def unionAllDataframes(df1: DataFrame, df2: DataFrame, df3: DataFrame): Future[DafaFrame] = {
        Future { df1.union(df2).union(df2) }
    }
    
    // At the end
    val finalDf = unionAllDataframes(
       aTransformationOnDf1,
       anotherTransformationOnDf1WithDf2,
       yetAnotherTransformationOnDf1WithDf2)
    
    finalDf.onComplete({
      case Success(df) => df.write(etc...)
      case Failure(exception) => handleException(exception)
    })

Sorry for the horrendous design and probably the wrong usage of Future. Once again, I am a bit confused on this scenario and I am trying to micro-optimize this passage (if possible).

Thanks a lot in advance! Cheers



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source