'Mongo write taking too long with Pyspark (sharded cluster)
I'm trying to read parquet files and dump it onto mongodb collection (sharded). When i do it without sharding, the write throughput is really good. But after sharding it has gone down drastically.
A single task is taking 30 mins plus, which is only processing 16 mb data
I'm using below Spark config
(
SparkConf()
.setMaster("yarn")
.set("spark.executor.memory", "30g")
.set("spark.executor.instances", "10")
.set("spark.executor.cores", "5")
.set("spark.sql.shuffle.partitions", "2000")
.set("spark.network.timeout", "800")
.set("spark.sql.broadcastTimeout", "1200")
.set("spark.default.parallelism", "2000")
.set('spark.jars', './mongo*.jar')
.set("spark.mongodb.input.uri", mongo_uri)
.set("spark.mongodb.input.database", db)
.set("spark.mongodb.input.collection", db_collection)
.set("spark.mongodb.output.uri", mongo_uri)
.set("spark.mongodb.output.database", db)
.set("spark.mongodb.output.collection", db_collection)
.set("spark.mongodb.input.partitionerOptions.partitionKey", shard_key)
.set("spark.mongodb.input.partitioner", "MongoShardedPartitioner")
.set("spark.mongodb.input.partitionerOptions.shardkey", shard_key)
)
I'm looking to dump 20 Billion plus records, 8 hours up and it's only inserted around 800 million documents.
The documents are same size, each document is of 250 KB
No additional indexes are being used.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

