'Spark bucketing optimization

Suppose a dataset has two partitions . We have bucketed this dataset on one column “Id” into 5 buckets .so 10 part files will be created. Similarly there is another dataset which has three partitions bucketed on “id” column into 5 buckets.so 15 part files will be created. Our objective is to join these tables on id column. My question is will there be shuffle ? How the join will happen ?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Spark bucketing optimization

Sources

Related Questions