'Spark bucketing optimization
Suppose a dataset has two partitions . We have bucketed this dataset on one column “Id” into 5 buckets .so 10 part files will be created. Similarly there is another dataset which has three partitions bucketed on “id” column into 5 buckets.so 15 part files will be created. Our objective is to join these tables on id column. My question is will there be shuffle ? How the join will happen ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
