'Match percentage of 2string - spark sql

I have a requirement to check match percentage of 2columns from a table.

For example:

Sample data:

ColA ColB
AAB Aab
AACC Aacc
WER Wer
from difflib import SequenceMatcher

def similar(a, b):
  return SequenceMatcher( None,a, b).ratio()
spark.udf.register('similar',similar)

Output:

similar('AAB','Aab')
Out[16]: 0.3333333333333333

I am able to achieve the requirement by using sequenceMatcher lib but the issue is I am not able to use that function inside spark sql and facing below error. Is there any other way we can achieve the same??

df=spark.sql(f"""SELECT ColA,ColB,Similar(ColA,ColB) FROM test""")
display(df)

Error: PythonException: 'AttributeError: 'SequenceMatcher' object has no attribute 'matching_blocks'', from , line 4. Full traceback below:



Solution 1:[1]

• SequenceMatcher accepts strings along with a junk value criteria.

• If any of these input strings are None the error occurs. Empty strings work.

Make sure None inputs are replaced by empty strings.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 AbhishekKhandave-MT