'spark how to indicate to driver node one item is done?
//work split
spark.parallelize(1, 10k).map(item => doTask(i)).collect()
where I have some logging to do in database for doTask(i). it's uneasy to serialize the dbManager and send to worker node. is there anyway for spark to indicate back to driver node that taskI is done and then do the logging on driver node ?
Solution 1:[1]
I can think of three options:
map
Within the map function (in the example within the method doTask) you could instantiate the dbManager and run the logging code. The dbManager will be created on the executor/worker nodes. However, this option will result in a very bad performance of the Spark job as there will be one db connection created per element of the rdd.
mapPartitions
Using mapPartitions only one dbManager per partition of the rdd will be created on the executor/worker nodes. Actually, creating a db connection within a Spark task is the typical use case for mapPartitions.
val result = spark.sparkContext.parallelize(1 to 10000).mapPartitions(it => {
//initalize and use database connection here
for( item <- it) yield {
doTask(item)
}
}).collect()
Depending on the size of the rdd and the number of partitions (or the possibility to find a good partitioner) this option will provide good performance characteristics.
Kafka or similar
The third option would be to use a messaging technique outside of Spark, for example Kafka.
Solution 2:[2]
The forth, and probably better option is to extend org.apache.spark.scheduler.SparkListener and provide implementation of onTaskEnd method according to your needs.
onTaskEnd
public void onTaskEnd(SparkListenerTaskEnd taskEnd)
Description copied from interface: SparkListenerInterface
Called when a task endsSpecified by: onTaskEnd in interface SparkListenerInterface
Parameters: taskEnd - (undocumented)
Then register this listener with SparkContext using addSparkListener in order for it to receive the task completion details.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | werner |
| Solution 2 | mazaneicha |
