'Split dataframe column at specific words

One column in my dataframe is a long string. I want to split out portions of the string into its own column based on a few different words. What would be the best way of doing this? Example below of what the data in the column looks like, and what i want to pull out.

Original Data:

ABC - Company Name Company Name Week 31 Drop 1

XYZ - Company Name Company Name Extra Cycle 9 Drop 2

I want to pull out the number after week and cycle. The number of words may vary by row but week and cycle will always be followed by a number. How would I pull out the number next to either word and put it in the same new column?

I have been able to pull out one or the other (the number after week or cylce), but not both into the same column. Havent gone down the route of directly iterating over the dataframe since thats not ideal

What i want the data to end up like:

String Number
ABC - Company Name Company Name Week 31 Drop 1 31
XYZ - Company Name Company Name Extra Cycle 9 Drop 2 9

Got it part of the way with the code below (can handle two words) but cant handle a third

df['cycle_num'] = df.project_name.str.extract('((?<=Cycle|Week\s).[0-9]+)', expand=True)

I get the following error if i try to add a third word:

re.error: look-behind requires fixed-width pattern



Solution 1:[1]

ended up finding a solution. this worked for me. not sure if this would lead to any exceptions (other than them just showing up in other locations) but it hasnt so far. no longer using week but capturing the number after other words/characters. I think I still had an issue with Joey's answer but it led to in the right direction if im remembering correctly

df['cycle_num'] = df.project_name.str.findall(r'(Cycle |#|Cycle |# |Cyc|Cyc )(\d+)')

Solution 2:[2]

Example dataframe:

df = pd.DataFrame([['ABC - Company Name Company Name Week 31 Drop 1'],
                   ['XYZ - Company Name Company Name Extra Cycle 9 Drop 2']],
                  columns=['a'])

enter image description here

Example code for new column:

import regex as rx
splitter = lambda x: int(rx.findall(r'(Week|Cycle) (\d+)', x)[0][1])
df['b'] = df.a.apply(splitter)

enter image description here

Solution 3:[3]

The error that you posted says that look-behind needs a fixed-width pattern. This means that you can only use words with the same length in your look-behind. The first two words, Cyle and Week\s, have the same length. The problem with the third word only occurs if you add a word that is shorter or longer.

To solve this issue you can or multiple look-behind instead of the words. I created an example which contains a third word of different size:

import pandas as pd

# Generating the data frame
rows = ['ABC - Company Name Company Name Week 31 Drop 1',
        'XYZ - Company Name Company Name Extra Cycle 9 Drop 2',
        'XYZ - Company Name Company Name Extra SomeOtherWord 432 Drop 2']
df = pd.DataFrame(rows)
df.columns = ['project_name']

df['cycle_num'] = df.project_name.str.extract('(((?<=Cycle\s)|(?<=Week\s)|(?<=SomeOtherWord\s))[0-9]+)', expand=True)[0]
print(df)

Output:

0  ABC - Company Name Company Name Week 31 Drop 1           31
1  XYZ - Company Name Company Name Extra Cycle 9 ...         9
2  XYZ - Company Name Company Name Extra SomeOthe...       432

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 lsdmctrubey
Solution 2 Joey Miths
Solution 3 JANO