'String Match in Hive
I have got two tables T1 and T2 . T1 has column C1 with all string values and similarly T2 has column C2 with all string values. The data is exact/complete in C2 and the data in C1 matches with C1 approximately. How can I compare these two string columns. Example: In T1 table, C1 has these values in each row- ArryGr BakerMem SummitHe
In T2, C2 has these values in each row- ArryGrade Bakermemory SummitHealth
Both the fields C1 and C2 has the same data but the data entered into these tables is different. Now how can I compare these two string columns of 2 different tables in hive?I need to join these two tables on T1.C1=T2.C2 and show which records have possible matches.
Solution 1:[1]
Your problem cannot be solved by a regular JOIN because of the differences in the string values. Your first option is to compare every record in T1 against every record in T2. This creates a cartesian product and has the potential to consume a ton of resources if you're not paying attention to how the data is distributed and how the cluster is setup.
I would first doublecheck that your hive settings are:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Hive has a string comparison function called levenshtein, which you will want to familiarize yourself with. You'll have to create a cutoff for how close things will need to be before you consider them "the same".
A common way to improve this, is to divide the levenshtein distance by the length of the longest string, to represent it as a percentage.
But, basically,
SELECT T1.C1, T2.C2
FROM (SELECT DISTINCT C1 FROM T1) T1,
(SELECT DISTINCT C2 FROM T2) T2
WHERE levenshtein(T1.C1, T2.C2) <= 4
Where you can adjust 4 up and down to customize how strict the match will be.
I highly recommend that you store the matches in a production table, so that you only perform this type of match once. If you approve, you store it and link the records together permanently so you don't have to redo this exercise.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Josh |
