'spark how to indicate to driver node one item is done?

//work split
spark.parallelize(1, 10k).map(item => doTask(i)).collect()

where I have some logging to do in database for doTask(i). it's uneasy to serialize the dbManager and send to worker node. is there anyway for spark to indicate back to driver node that taskI is done and then do the logging on driver node ?



Solution 1:[1]

I can think of three options:

map

Within the map function (in the example within the method doTask) you could instantiate the dbManager and run the logging code. The dbManager will be created on the executor/worker nodes. However, this option will result in a very bad performance of the Spark job as there will be one db connection created per element of the rdd.

mapPartitions

Using mapPartitions only one dbManager per partition of the rdd will be created on the executor/worker nodes. Actually, creating a db connection within a Spark task is the typical use case for mapPartitions.

val result = spark.sparkContext.parallelize(1 to 10000).mapPartitions(it => {
  //initalize and use database connection here
  for( item <- it) yield {
    doTask(item)
  }
}).collect()

Depending on the size of the rdd and the number of partitions (or the possibility to find a good partitioner) this option will provide good performance characteristics.

Kafka or similar

The third option would be to use a messaging technique outside of Spark, for example Kafka.

Solution 2:[2]

The forth, and probably better option is to extend org.apache.spark.scheduler.SparkListener and provide implementation of onTaskEnd method according to your needs.

onTaskEnd

public void onTaskEnd(SparkListenerTaskEnd taskEnd)

Description copied from interface: SparkListenerInterface
Called when a task ends

Specified by: onTaskEnd in interface SparkListenerInterface

Parameters: taskEnd - (undocumented)

Then register this listener with SparkContext using addSparkListener in order for it to receive the task completion details.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 werner
Solution 2 mazaneicha