'Mongo write taking too long with Pyspark (sharded cluster)

I'm trying to read parquet files and dump it onto mongodb collection (sharded). When i do it without sharding, the write throughput is really good. But after sharding it has gone down drastically.

A single task is taking 30 mins plus, which is only processing 16 mb data

I'm using below Spark config

(
     SparkConf()
    .setMaster("yarn")
    .set("spark.executor.memory", "30g")
    .set("spark.executor.instances", "10")
    .set("spark.executor.cores", "5")
    .set("spark.sql.shuffle.partitions", "2000")
    .set("spark.network.timeout", "800")
    .set("spark.sql.broadcastTimeout", "1200")
    .set("spark.default.parallelism", "2000") 
    .set('spark.jars', './mongo*.jar')
    .set("spark.mongodb.input.uri", mongo_uri)
    .set("spark.mongodb.input.database", db)
    .set("spark.mongodb.input.collection", db_collection)
    .set("spark.mongodb.output.uri", mongo_uri)
    .set("spark.mongodb.output.database", db)
    .set("spark.mongodb.output.collection", db_collection)
    .set("spark.mongodb.input.partitionerOptions.partitionKey", shard_key)
    .set("spark.mongodb.input.partitioner", "MongoShardedPartitioner")
    .set("spark.mongodb.input.partitionerOptions.shardkey", shard_key) 
)

I'm looking to dump 20 Billion plus records, 8 hours up and it's only inserted around 800 million documents.

The documents are same size, each document is of 250 KB

No additional indexes are being used.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Mongo write taking too long with Pyspark (sharded cluster)

Sources

Related Questions