'Extracting Country name and year from column and create a new columns in dataframe

In a situation below:

Source Value
sx_India-2021 23
dfr_Australia-1987 21
df_fg-Africa-2022 34

I have used "str.split" but it is not working for all the records. I need a separate column-"country" and "Year" from "Source".



Solution 1:[1]

Update

df[['Country', 'Year']] = pd.DataFrame(df['Source'].str.split(r'[_-]').str[-2:].tolist())
print(df)

# Output
               Source  Value    Country  Year
0       sx_India-2021     23      India  2021
1  dfr_Australia-1987     21  Australia  1987
2   df_fg-Africa-2022     34     Africa  2022

Use:

pat = r'[_-](?P<Country>[^-]*)-(?P<Year>\d{4})'
df = pd.concat([df, df['Source'].str.extract(pat)], axis=1)
print(df)

# Output
               Source  Value    Country  Year
0       sx_India-2021     23      India  2021
1  dfr_Australia-1987     21  Australia  1987
2   df_fg-Africa-2022     34     Africa  2022

If you don't need Source column, use:

df = pd.concat([df.pop('Source').str.extract(pat), df], axis=1)
print(df)

# Output
     Country  Year  Value
0      India  2021     23
1  Australia  1987     21
2     Africa  2022     34

Solution 2:[2]

IIUC, you could use str.extract:

df['Source'].str.extract(r'([^-_]+)-\d+$', expand=False)

output:

0        India
1    Australia
2       Africa
Name: Source, dtype: object

for both Country and Year:

df[['Country', 'Year']] = df['Source'].str.extract('([^-_]+)-(\d+)$')

output:

               Source  Value    Country  Year
0       sx_India-2021     23      India  2021
1  dfr_Australia-1987     21  Australia  1987
2   df_fg-Africa-2022     34     Africa  2022

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2