'How do we optimise an incremental merge involving a very large target table (10 TB) and smaller incremental source table in a data lake environment?

I came across this question recently in one of the interviews and haven't been able to find a satisfying answer to this question. The incremental merge could contain new records for inserts as well as updates to older records.

I proposed using merge statement but the interviewer said that would involve scanning the entire target table which is not desired. Any better way to merge without scanning the larger table completely?



Solution 1:[1]

The best way to deal with it is to use Boradcast Join in spark. you can define property park.sql.autoBroadcastJoinThreshold and set its value to the approx size your smaller incremental source table.

So when your small Data (incremental record dataset) arrives , its copies are created and sent to each node containing large dataset. In this way network shuffle cost os saved as Large data set does not have to shuffle in Network

example : df_join = Large_dataset.join(broadcast(small_dataset)).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sachin Tiwari