'regular expression: How to match a list of words (allow combination)?
I'm trying to construct a regular expression to capture units and the corresponding values.
For example,
import re
candis = ['mmol','mm']
test_reg = '|'.join([ut+r"\-?[1-4]?" for ut in candis])
test_reg = r"\b(?:" + test_reg + r")\b"
test_reg = r"\d (?:" + test_reg + r"\s?){1,3}"
test_str = '3 mmol mm'
re.findall(test_reg,test_str)
the test_reg is constructed to capture the unit mmol mm and the corresponding value of 3.
However, as you can readily observe in the example, test_reg does not work for a string like 3 mmol2mm because of the \b.
How can I construct a regular expression that can also match 3 mmol2mm and 3 mmolmm, which only contains word combinations that are strictly from candis? (3 mmol mmb won't match)
Solution 1:[1]
You can use
\d+(?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3}))\1\b
See the regex demo. Details:
\d+- one or more digits(?=((?:\s*(?:mmol|mm)-?[1-4]?){1,3}))- a positive lookahead with a capturing group inside used to imitate an atomic group, that matches a location that is immediately followed with(?:\s*(?:mmol|mm)-?[1-4]?){1,3}- one, two or three occurrences of\s*- zero or more whitespaces(?:mmol|mm)- acandisvalue-?- an optional-char[1-4]?- an optional digit from1to4
\1- Group 1 value (backreferences do not allow backtracking)\b- word boundary.
See the Python demo:
import re
candis = ['mmol','mm']
test_reg = r"\d+(?=((?:\s*(?:{})-?[1-4]?){{1,3}}))\1\b".format('|'.join(candis))
test_str = '3 mmol mm 3 mmol2mm and 3 mmolmm AND NOT 3 mmol mmb'
print( [x.group() for x in re.finditer(test_reg,test_str)] )
Output:
['3 mmol mm', '3 mmol2mm', '3 mmolmm']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
