'split string until a 5-7 digit number is found in python
I have strings like the following:
1338516 -...pair - 5pk 1409093 -...re Wax 3Pk
1409085 -...dtnr - 5pk 1415090 -...accessories
490663 - 3 pack 1490739 -...2 - 3 pack
What I'm trying to do is, split these strings so that the first string is 1338516 -...pair - 5pk and the second one is 1409093 -...re Wax 3Pk.
Currently, I'm able to extract the numbers using the following code:
list(filter(lambda k: '...' in k, reqText))
lst1 = ''.join(lst)
numbers = re.findall(r'\d+', lst1)
numbers1 = [x for x in numbers if len(x) > 3]
Any suggestions?
Solution 1:[1]
You could use split with a pattern:
[^\S\n]+(?=\d{5,7}\b)
Explanation
[^\S\n]+Match 1 or more spaces without a newline(?=\d{5,7}\b)Positive lookahead, assert 5-7 digits to the right followed by a word boundary
import re
pattern = r"[^\S\n]+(?=\d{5,7}\b)"
lst = [
"1338516 -...pair - 5pk 1409093 -...re Wax 3Pk",
"1409085 -...dtnr - 5pk 1415090 -...accessories",
"490663 - 3 pack 1490739 -...2 - 3 pack"
]
for s in lst:
print(re.split(pattern, s))
Output
['1338516 -...pair - 5pk', '1409093 -...re Wax 3Pk']
['1409085 -...dtnr - 5pk', '1415090 -...accessories']
['490663 - 3 pack', '1490739 -...2 - 3 pack']
Another option could be a matching approach:
\b\d{5,7}\b.*?(?=[^\S\n]+\d{5,7}\b|$)
Solution 2:[2]
You can use
^(.+?)\s*\b(\d{5,7}\b.*)
See the regex demo.
In Python, use a raw string literal to declare this regex:
pattern = r'^(.+?)\s*\b(\d{5,7}\b.*)'
Details:
^- start of string(.+?)- Group 1: one or more (but as few as possible) occurrences of any char other than line break chars\s*- zero or more whitespaces\b- a word boundary(\d{5,7}\b.*)- Group 2: five-seven digit number, word boundary and the rest of the line.
See a Python demo:
import re
text = "1338516 -...pair - 5pk 1409093 -...re Wax 3Pk"
pattern = r'^(.+?)\s*\b(\d{5,7}\b.*)'
m = re.search(pattern, text)
if m:
print(m.group(1)) # => 1338516 -...pair - 5pk
print(m.group(2)) # => 1409093 -...re Wax 3Pk
If you need to use it in a Pandas dataframe, you can use
df[['result_col_1', 'result_col_2']] = df['source'].str.extract(pattern, expand=True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
