'Replace combination of space, hyphen and text or a "by" using regex and pandas

I want to replace a combination of a space, an hyphen, a space and text or the combination "By [Author]". This is my data frame:

my_titles = ['Peter Rabbit - Volume II', 'Who stole my cookie  By Cole Pattesh', 'The Stormy Night -  Nia Costas']
adf = pd.DataFrame({'my_titles':my_titles})
adf
    my_titles
0   Peter Rabbit - Volume II
1   Who stole my cookie By Cole Pattesh
2   The Stormy Night - Nia Costas

My expected df is:

    my_titles
0   Peter Rabbit
1   Who stole my cookie
2   The Stormy Night

I have tried this, expecting regex to recognize the '\s' space and the '|' (or):

adf['my_titles'].replace('\s-\s*|\sBy\s*$','',regex=True)
adf

And I tried this too trying to chain the space and words:

adf['my_titles'].replace('[ - \w]|[ By \w]','',regex=True)
adf

Please, do you know what I am doing wrong?



Solution 1:[1]

You can use

import pandas as pd
my_titles = ['Peter Rabbit - Volume II', 'Who stole my cookie  By Cole Pattesh', 'The Stormy Night -  Nia Costas']
adf = pd.DataFrame({'my_titles':my_titles})
adf['my_titles'] = adf['my_titles'].str.replace(r'\s+(?:-\s+|By\s+[A-Z]).*', '', regex=True)

Ouput of print(adf['my_titles']):

0           Peter Rabbit
1    Who stole my cookie
2       The Stormy Night

See the regex demo. Details:

  • \s+ - one or more whitespaces
  • (?:-\s+|By\s+[A-Z]) - a - and one or more whitespaces, or By, one or more whitespaces, and an uppercase letter
  • .* - the rest of the line.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Wiktor Stribiżew