'Extracting Country name and year from column and create a new columns in dataframe
In a situation below:
| Source | Value |
|---|---|
| sx_India-2021 | 23 |
| dfr_Australia-1987 | 21 |
| df_fg-Africa-2022 | 34 |
I have used "str.split" but it is not working for all the records. I need a separate column-"country" and "Year" from "Source".
Solution 1:[1]
Update
df[['Country', 'Year']] = pd.DataFrame(df['Source'].str.split(r'[_-]').str[-2:].tolist())
print(df)
# Output
Source Value Country Year
0 sx_India-2021 23 India 2021
1 dfr_Australia-1987 21 Australia 1987
2 df_fg-Africa-2022 34 Africa 2022
Use:
pat = r'[_-](?P<Country>[^-]*)-(?P<Year>\d{4})'
df = pd.concat([df, df['Source'].str.extract(pat)], axis=1)
print(df)
# Output
Source Value Country Year
0 sx_India-2021 23 India 2021
1 dfr_Australia-1987 21 Australia 1987
2 df_fg-Africa-2022 34 Africa 2022
If you don't need Source column, use:
df = pd.concat([df.pop('Source').str.extract(pat), df], axis=1)
print(df)
# Output
Country Year Value
0 India 2021 23
1 Australia 1987 21
2 Africa 2022 34
Solution 2:[2]
IIUC, you could use str.extract:
df['Source'].str.extract(r'([^-_]+)-\d+$', expand=False)
output:
0 India
1 Australia
2 Africa
Name: Source, dtype: object
for both Country and Year:
df[['Country', 'Year']] = df['Source'].str.extract('([^-_]+)-(\d+)$')
output:
Source Value Country Year
0 sx_India-2021 23 India 2021
1 dfr_Australia-1987 21 Australia 1987
2 df_fg-Africa-2022 34 Africa 2022
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
