'return two values regex python

I have a data frame where the values I want are in the same cell like this - words and all:

depth: 3230 m - 3750 m

I'm trying to write a regex to return the first number and then the second into a new data frame. so far, I can get the values with this:

 top_depthdf=df[0].str.extract(r'depth:\s(\d+(?:\.\d+)?)', flags=re.I).astype(float)
 base_depthdf=df[0].str.extract(r'-\s(\d+(?:\.\d+)?)', flags=re.I).astype(float)

where I am having an issue is that these patterns are not unique in this data, especially the base depth one. Other numbers have a similar pattern and my script is returning them instead of the base depth if they occur before the depth row. I was wondering if there is a way to write the base_depthdf in such a way that it looks for the 'depth:' part first and then looks for that pattern?



Solution 1:[1]

You can capture these numbers with two named capturing groups into two columns at once:

df_depth = df[0].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?)(?:\s*\w+)?\s*-\s*(?P<base_depth>\d+(?:\.\d+)?)')

See the regex demo. The (?P<top_depth>...) and (?P<base_depth>...) capture the details into separate columns.

I used (?:\s*\w+)?\s* to match a single optional word between the two patterns, but you may just use .*? if you are not sure what can appear between the two:

df_depth = df[0].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?).*?-\s*(?P<base_depth>\d+(?:\.\d+)?)')

Pandas test:

df = pd.DataFrame({'c':['depth: 3230 m - 3750 m']})
df_depth = df['c'].str.extract(r'depth:\s*(?P<top_depth>\d+(?:\.\d+)?)(?:\s*\w+)?\s*-\s*(?P<base_depth>\d+(?:\.\d+)?)')
print(df_depth.to_string())

Output:

  top_depth base_depth
0      3230       3750

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew