'Using CombineFn instead of CoGroupByKey to handle hotkeys

I have a two very huge PCollections <KV<Long,XYZ>> and <KV<Long,ABC>>. I need to create a PCollection <KV<XYZ,ABC>> which I am able to using CoGroupByKey.create() transform. It works fine for smaller data set but in case of hotkeys it gets stuck. I am new to beam, I am trying to figure out how to use CombineFn to solve this. For now my code looks like this

final PCollection <KV<Long,XYZ>> xyzKV;
final PCollection <KV<Long,ABC>> abcKV;
final TupleTag<XYZ> t1 = new TupleTag<>();
final TupleTag<ABC> t1 = new TupleTag<>();
final PCollection <KV<XYZ,ABC>> combinedCollection = 
                               KeyedPCollectionTuple.of(t1, xyzKV).and(t2, abcKV)
    .apply(CoGroupByKey.create());
    
// this works fine but has performance issues in case of hotkeys.



Solution 1:[1]

It depends on which runner you are using and what options they offer. A similar question is Join two large volumne of PCollection has performance issue.

If you are using the DataflowRunner, you can enable the shuffle service to speed up the process.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 ningk