'Identifying common elements in a list of words
I have list of words in a column where I need to find common elements. For example, list contains words such as,
sinazz31 sinazz12 45sinazz sinazz_84
As you can see, the common element is “sinazz”. Is there a way to develop an algorithm in Python to identify such common elements? If the length of the words are less than 4, the words can be ignored.
Solution 1:[1]
You could search for substrings contained in all of the source strings. Starting with the length of the shortest string and going down from there:
string = 'sinazz31 sinazz12 45sinazz sinazz_84'
min_substring_length = 3
words = string.split()
longest_word = max(filter(None, words), key=len)
matches = {}
for sub_length in range(len(longest_word), min_substring_length - 1, -1):
for x in range(len(longest_word) - sub_length):
substring = longest_word[(0 + x):(sub_length + x)] # create substring to check
check = len([1 for word in words if (substring in word)]) # number of words containing substring
if check > 1:
matches[substring] = check # number of words containing substring
# results
if matches:
match_list = list(sorted(matches,key=matches.get,reverse=True)) # list of matches by frequency
if matches[match_list[0]] == len(words): # prints substring if matches all words
print('best match for all words:',match_list[0])
print('best to worst:',match_list)
Solution 2:[2]
Have a look at this similar question: (Find most common substring in a list of strings?)
I added in the condition that it won't match the word if the length is less than 4
from difflib import SequenceMatcher
substring_counts={}
list = ['sinazz31', 'sinazz12', '45sinazz', 'sinazz_84']
for i in range(0, len(list)):
for j in range(i+1,len(list)):
string1 = list[i]
string2 = list[j]
match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
matching_substring=string1[match.a:match.a+match.size]
if(matching_substring not in substring_counts and len(matching_substring) > 3):
substring_counts[matching_substring]=1
else:
substring_counts[matching_substring]+=1
print(substring_counts)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | timothyh |
| Solution 2 |
