'Regex: drop numbers with some symbols
I try to clean my text. So I need to remove some numbers and also some combinations of numbers and symbols.
I have a string
s = '4/13/2022 2:20:03 pm from our side a more detailed analysis4 +7 (495) 797-8700 77-8282'
And I want to get
'pm from our side a more detailed analysis4'
I tried to use
re.compile(r'\b(?:/|-|\+|\:)(\d+)\b').sub(r' ', s)
but it returns me
'4 2 pm from our side a more detailed analysis4 +7 (495) 797 77 '
What I do wrong and how can I drop just numbers and combinations of number and a specific symbol?
Solution 1:[1]
You might match at least a single non word character surrounded by optional digits and trim the result
(?<!\S)\d*(?:[^\w\s]+\d*)+\s*
Explanation
(?<!\S)Assert a whitspace boundary to the leeft\d*Match optional digits(?:[^\w\s]+\d*)+Match 1+ times at least a non word character and optional digits\s*Match optional whitespace chars
import re
pattern = r"(?<!\S)\d*(?:[^\w\s]+\d*)+\s*"
s = "4/13/2022 2:20:03 pm from our side a more detailed analysis4 +7 (495) 797-8700 77-8282 kl-1381033 substr1.substr2.ab-2021-44228.a"
print(re.sub(pattern, "", s))
Output
ppm from our side a more detailed analysis4 kl-1381033 substr1.substr2.ab-2021-44228.a
Solution 2:[2]
Try this expression :
(?:\/|-|\+|\:|^|\(|\)| ) ?(\d+)
You can test it there : https://regex101.com/r/aANxBR/1
Solution 3:[3]
It appears you want to remove words that start with digits and symbols.
You could do:
import re
s = '4/13/2022 2:20:03 pm from our side a more detailed analysis4 +7 (495) 797-8700 77-8282 kl-1381033 substr1.substr2.ab-2021-44228.a'
>>> ' '.join(w for w in s.split() if not re.match(r'[\d(+]\S+', w))
'pm from our side a more detailed analysis4 kl-1381033 substr1.substr2.ab-2021-44228.a'
Including a completely Python solution:
bad_start='0123456789+('
>>> ' '.join(w for w in s.split() if w[0] not in bad_start)
'pm from our side a more detailed analysis4 kl-1381033 substr1.substr2.ab-2021-44228.a'
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | LCMa |
| Solution 3 |
