'Is leftsemi faster with smaller table on the left side?

Join operator documentation says:

Tip

For best performance, if one table is always smaller than the other, use it as the left (piped) side of the join.

The purpose of leftsemi in most cases is to filter a bigger set on the left by a smaller set on the right. Is the quote above still applicable to leftsemi flavor of join operator?



Solution 1:[1]

At least at this point, Tables' order does matter.

Here is a quick test results, executed on my dev cluster:

Setup

.set-or-replace L100M <| range i from 1 to 100000000 step 1

.set-or-replace S1M <| range i from 1 to 1000000 step 1 | project i = tolong(rand(100000000))

rightsemi (small table first)

S1M | join kind=rightsemi L100M on i | consume

Query completes in around 3 seconds


leftsemi (large table first)

L100M | join kind=leftsemi S1M on i | consume

Query runs about 20 seconds and then fails with the following exception:

Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'bad allocation', details: '').

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1