'Regular Expression for split multiple strings with different pattern
I am trying to split strings into multiple strings using regex. I have strings like the following:
'1. 10.25% 2. 11% 3. 9.75% 4. 4.3%'
'1.promising.2.inappropriately3.essential.4.intense.'
'1. He has not been attending 2. English classes 3. since one month4. No error'
'1. X got 15 shares2. B got 25 shares3. W got 54. shares.4. Mark got 2.5 shares'
I am expecting output like this:
'1. X got 15 shares' '2. B got 25 shares', '3. W got 54. shares.', '4. Mark got 2.5 shares'
'1. 10.25%'
'2. 11% '
'3. 9.75%'
' 4. 4.3%'
I want to write a single expression that split all the given scenarios. I tried writing the following expression but it fails in some cases
re.split(r'(?=[1-9]{1}\.[\s]?[a-zA-Z0-9\.\:\(\)\-\,\% ]+)', string)
Solution 1:[1]
I'd suggest looking for each subsequent number used in a (?<!\d)NUM\. (the NUM with a . right after and no other preceding digit) pattern and split at those positions only:
import re
texts = ['1. 10.25% 2. 11% 3. 9.75% 4. 4.3%',
'1.promising.2.inappropriately3.essential.4.itense.',
'1. He has not been attending 2. English classes 3. since one month4. No error',
'1. X got 15 shares2. B got 25 shares3. W got 54. shares.4. Mark got 2.5 shares']
pattern = r'(?<!\d){}\.'
for text in texts:
bps = []
prev = 0
for i in range(1,1000):
rx = re.compile(pattern.format(i))
m = rx.search(text, prev)
if m:
if prev != m.start():
bps.append(text[prev:m.start()].strip())
prev = m.start()
else:
break
if prev < len(text) - 1:
bps.append(text[prev:].strip())
print(bps)
See the Python demo.
Output:
['1. 10.25%', '2. 11%', '3. 9.75%', '4. 4.3%']
['1.promising.', '2.inappropriately', '3.essential.', '4.itense.']
['1. He has not been attending', '2. English classes', '3. since one month', '4. No error']
['1. X got 15 shares', '2. B got 25 shares', '3. W got 54. shares.', '4. Mark got 2.5 shares']
Note the rx = re.compile(pattern.format(i)) and m = rx.search(text, prev) lines: the pattern is compiled since the Pattern.search method allows searching from the specified position, which is the previous match start position.
The range(1,1000) part can be adjusted, 1000 assumes you have up to 999 bullet points in the text.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
