'Spark - partitioning/bucketing of n-tables with overlapping but not identical ids

i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:

Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall

Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.

One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.

In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.


I now want to optimize this join and came up on the bucketing feature. Is this possible?

Example: 1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?

Or am i having a general problem of understanding how it works?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source