'Python pandas group by check if value changed then previous value
I've a problem with groupby function of pandas's library. I've the following dataframe.
d = {'id': [400001, 400001, 400001, 400002, 400003, 400003, 400004, 400004], 'result': ['P','N','N','N','N','N','N','P'], 'date':['27/10/2021','09/09/2021','03/07/2020','03/07/2020','30/06/2020','27/04/2022','27/04/2022','30/06/2020']}
df = pd.DataFrame(data=d)
df
| id | result | date |
|---|---|---|
| 400001 | N | 2020-07-03 |
| 400001 | N | 2021-09-09 |
| 400001 | P | 2021-10-27 |
| 400002 | N | 2020-07-03 |
| 400003 | N | 2020-06-30 |
| 400003 | N | 2022-04-27 |
| 400004 | P | 2020-06-30 |
| 400004 | N | 2022-04-27 |
I need to group by column 'id' and extract the value of column 'date' where the value of column 'result' change. If value in column 'result' doesn't change, keep the first value of column 'date'.
This an example:
| id | date |
|---|---|
| 400001 | 2021-10-27 |
| 400002 | 2020-07-03 |
| 400003 | 2020-06-30 |
| 400004 | 2022-04-27 |
I've tried this:
df['change'] = np.where(df.groupby('id').esito.apply(lambda x:x!=x.iloc[0]),'Y','N')
but the function doesn't works so well. The function verify the difference versus first element of column 'id' of groupby selection. I don't need this.
Can you help me? Thanks
Solution 1:[1]
You can compute a cumsum of the booleans identifying the changes. Then get the max index:
idx = (df.groupby('id')['result']
.apply(lambda s: s.ne(s.shift())
.cumsum()
.idxmax()
)
)
df.loc[idx]
Output:
id result date
1 400001 N 09/09/2021
3 400002 N 03/07/2020
4 400003 N 30/06/2020
7 400004 P 30/06/2020
NB. The input provided as DataFrame is different from the one as table. The output matching the DataFrame is shown here.
If needed, sort the dates first:
idx = (df.sort_values(by=['id', 'date'])
.groupby('id')['result']
.apply(lambda s: s.ne(s.shift())
.cumsum()
.idxmax()
)
)
df.loc[idx]
Output:
id result date
0 400001 P 27/10/2021
3 400002 N 03/07/2020
5 400003 N 27/04/2022
7 400004 P 30/06/2020
Solution 2:[2]
You can use drop duplicates twice to get what you want:
import pandas as pd
d = {'id': [400001, 400001, 400001, 400002, 400003, 400003, 400004, 400004],
'result': ['N', 'N', 'P', 'N', 'N', 'N', 'P', 'N'],
'date': ['27/10/2021', '09/09/2021', '03/07/2020', '03/07/2020', '30/06/2020', '27/04/2022', '27/04/2022',
'30/06/2020']}
df = pd.DataFrame(data=d)
df.drop_duplicates(subset=['id', 'result'], keep='first', inplace=True)
df.drop_duplicates(subset=['id'], keep='last', inplace=True)
print(df)
Output:
id result date
2 400001 P 03/07/2020
3 400002 N 03/07/2020
4 400003 N 30/06/2020
7 400004 N 30/06/2020
Notice that the output in your question may be wrong based on your requirements.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Menglong Li |
