'Split dataframe column at specific words
One column in my dataframe is a long string. I want to split out portions of the string into its own column based on a few different words. What would be the best way of doing this? Example below of what the data in the column looks like, and what i want to pull out.
Original Data:
ABC - Company Name Company Name Week 31 Drop 1
XYZ - Company Name Company Name Extra Cycle 9 Drop 2
I want to pull out the number after week and cycle. The number of words may vary by row but week and cycle will always be followed by a number. How would I pull out the number next to either word and put it in the same new column?
I have been able to pull out one or the other (the number after week or cylce), but not both into the same column. Havent gone down the route of directly iterating over the dataframe since thats not ideal
What i want the data to end up like:
String | Number |
---|---|
ABC - Company Name Company Name Week 31 Drop 1 | 31 |
XYZ - Company Name Company Name Extra Cycle 9 Drop 2 | 9 |
Got it part of the way with the code below (can handle two words) but cant handle a third
df['cycle_num'] = df.project_name.str.extract('((?<=Cycle|Week\s).[0-9]+)', expand=True)
I get the following error if i try to add a third word:
re.error: look-behind requires fixed-width pattern
Solution 1:[1]
ended up finding a solution. this worked for me. not sure if this would lead to any exceptions (other than them just showing up in other locations) but it hasnt so far. no longer using week but capturing the number after other words/characters. I think I still had an issue with Joey's answer but it led to in the right direction if im remembering correctly
df['cycle_num'] = df.project_name.str.findall(r'(Cycle |#|Cycle |# |Cyc|Cyc )(\d+)')
Solution 2:[2]
Example dataframe:
df = pd.DataFrame([['ABC - Company Name Company Name Week 31 Drop 1'],
['XYZ - Company Name Company Name Extra Cycle 9 Drop 2']],
columns=['a'])
Example code for new column:
import regex as rx
splitter = lambda x: int(rx.findall(r'(Week|Cycle) (\d+)', x)[0][1])
df['b'] = df.a.apply(splitter)
Solution 3:[3]
The error that you posted says that look-behind needs a fixed-width pattern. This means that you can only use words with the same length in your look-behind. The first two words, Cyle
and Week\s
, have the same length. The problem with the third word only occurs if you add a word that is shorter or longer.
To solve this issue you can or
multiple look-behind instead of the words. I created an example which contains a third word of different size:
import pandas as pd
# Generating the data frame
rows = ['ABC - Company Name Company Name Week 31 Drop 1',
'XYZ - Company Name Company Name Extra Cycle 9 Drop 2',
'XYZ - Company Name Company Name Extra SomeOtherWord 432 Drop 2']
df = pd.DataFrame(rows)
df.columns = ['project_name']
df['cycle_num'] = df.project_name.str.extract('(((?<=Cycle\s)|(?<=Week\s)|(?<=SomeOtherWord\s))[0-9]+)', expand=True)[0]
print(df)
Output:
0 ABC - Company Name Company Name Week 31 Drop 1 31
1 XYZ - Company Name Company Name Extra Cycle 9 ... 9
2 XYZ - Company Name Company Name Extra SomeOthe... 432
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | lsdmctrubey |
Solution 2 | Joey Miths |
Solution 3 | JANO |