'How to count letter based similarity on pandas dataframe
Here's my first dataframe df1
Id Text
1 dFn
2 fiqe
3 raUw
Here's my second dataframe df2
Id Text
1 yuw
2 dnag
Similarity Matrix, columns is Id from df1, rows is Id from df2
1 2 3
1 0 0 0.66
2 0.5 0 0.25
Note:
0 value in (1,1), (2,1) and (3,2) because no letter similar
0.25 value in (3,1) is because of only 1 letter from raUw avaliable in 4 letter `dnag' (1/4 equals 0.25)
0.5 is counted because of 2 of 4 letter similar
0.66 is counted because of 2 of 3 words similar
Solution 1:[1]
IIUC, one option is to use set.intersection in a nested list comprehension:
out = pd.DataFrame([[len(set(x.lower()) & set(y.lower())) / len(x) for y in df1['Text'].tolist()] for x in df2['Text'].tolist()])
Output:
0 1 2
0 0.0 0.0 0.666667
1 0.5 0.0 0.250000
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
