'return two values regex python
I have a data frame where the values I want are in the same cell like this - words and all:
depth: 3230 m - 3750 m
I'm trying to write a regex to return the first number and then the second into a new data frame. so far, I can get the values with this:
top_depthdf=df[0].str.extract(r'depth:\s(\d+(?:\.\d+)?)', flags=re.I).astype(float)
base_depthdf=df[0].str.extract(r'-\s(\d+(?:\.\d+)?)', flags=re.I).astype(float)
where I am having an issue is that these patterns are not unique in this data, especially the base depth one. Other numbers have a similar pattern and my script is returning them instead of the base depth if they occur before the depth row. I was wondering if there is a way to write the base_depthdf in such a way that it looks for the 'depth:' part first and then looks for that pattern?
Solution 1:[1]
You can capture these numbers with two named capturing groups into two columns at once:
df_depth = df[0].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?)(?:\s*\w+)?\s*-\s*(?P<base_depth>\d+(?:\.\d+)?)')
See the regex demo. The (?P<top_depth>...) and (?P<base_depth>...) capture the details into separate columns.
I used (?:\s*\w+)?\s* to match a single optional word between the two patterns, but you may just use .*? if you are not sure what can appear between the two:
df_depth = df[0].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?).*?-\s*(?P<base_depth>\d+(?:\.\d+)?)')
Pandas test:
df = pd.DataFrame({'c':['depth: 3230 m - 3750 m']})
df_depth = df['c'].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?)(?:\s*\w+)?\s*-\s*(?P<base_depth>\d+(?:\.\d+)?)')
print(df_depth.to_string())
Output:
top_depth base_depth
0 3230 3750
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Wiktor Stribiżew |
