'spark number of partitions

I have a question regarding number of partitions in a dataframe when a data is read from s3. I read in different forums that by default it creates partition of 36MB meaning 100 MB file will get split into 3 partitions. But when I tried it on amazon EMR, i got 2 partitions for 10 mb file. Is this expected, or I am missing something here? I am using

df.rdd.getNumPartitions()

method to view number of partitions. Moreover when I tried reading 3 GB datasets spread over 7 files, I got 31 partitions so I'm a bit confused here. I got 2 partitions for 10 mb file because pyspark gets launched with 2 as the default value for --num-executors flag ?

pyspark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'spark number of partitions

Sources

Related Questions