'Terminology matching in text
I have a list of terms as below:
a
abc
a abc
a a abc
abc
I want to match the terms in text and changes their name as "term1, term2". But I want to find the longest match as the correct match.
Text: I have a and abc maybe abc again and also a a abc.
Output: I have term1 and term2 maybe term2 again and also a term3.
So far I used the code below but it does not find the longest match:
for x in terms:
if x in text:
do blabla
Solution 1:[1]
You can use re.sub
import re
words = ["a",
"abc",
"a abc",
"a a abc"
]
test_str = "I have a and abc maybe abc again and also a a abc."
for word in sorted(words, key=len, reverse=True):
term = "\1term%i\2" % (words.index(word)+1)
test_str = re.sub(r"(\b)%s(\b)"%word, term, test_str)
print(test_str)
It will get your “expect” result (you made a mistake in the example)
Input: I have a and abc maybe abc again and also a a abc.
Output: I have term1 and term2 maybe term2 again and also term4.
Solution 2:[2]
or using a re.sub replace function:
import re
text = 'I have a and abc maybe abc again and also a a abc'
words = ['a', 'abc', 'a abc', 'a a abc']
regex = re.compile(r'\b' + r'\b|\b'.join(sorted(words, key=len, reverse=True)) + r'\b')
def replacer(m):
print 'replacing : %s' % m.group(0)
return 'term%d' % (words.index(m.group(0)) + 1)
print re.sub(regex, replacer, text)
result:
replacing : a
replacing : abc
replacing : abc
replacing : a a abc
I have term1 and term2 maybe term2 again and also term4
or use an anonymous replacer:
print re.sub(regex, lambda m: 'term%d' % (words.index(m.group(0)) + 1), text)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Taku |
| Solution 2 |
