'Using CombineFn instead of CoGroupByKey to handle hotkeys
I have a two very huge PCollections <KV<Long,XYZ>> and <KV<Long,ABC>>. I need to create a PCollection <KV<XYZ,ABC>> which I am able to using CoGroupByKey.create() transform. It works fine for smaller data set but in case of hotkeys it gets stuck. I am new to beam, I am trying to figure out how to use CombineFn to solve this. For now my code looks like this
final PCollection <KV<Long,XYZ>> xyzKV;
final PCollection <KV<Long,ABC>> abcKV;
final TupleTag<XYZ> t1 = new TupleTag<>();
final TupleTag<ABC> t1 = new TupleTag<>();
final PCollection <KV<XYZ,ABC>> combinedCollection =
KeyedPCollectionTuple.of(t1, xyzKV).and(t2, abcKV)
.apply(CoGroupByKey.create());
// this works fine but has performance issues in case of hotkeys.
Solution 1:[1]
It depends on which runner you are using and what options they offer. A similar question is Join two large volumne of PCollection has performance issue.
If you are using the DataflowRunner, you can enable the shuffle service to speed up the process.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ningk |
