'How to create column marking latest date from grouped data
I am working with a data frame that looks like the following:
| ID 1 | Type | Name | Date | Marked | Updated_Marked |
|---|---|---|---|---|---|
| 1 | abs | 576 | 2009 | Last | |
| 1 | sqr | 573 | 2020 | ||
| 1 | fun | 234 | 2020 | ||
| 1 | no | 556 | 2020 | ||
| 2 | abs | 364 | 2015 | ||
| 2 | sqr | 937 | 2019 | ||
| 2 | fun | 562 | 2020 | Last | |
| 2 | ok | 794 | 2021 | ||
| 3 | abs | 846 | 2012 | Last | |
| 3 | fun | 540 | 2015 | ||
| 3 | sqr | 276 | 2016 | ||
| 3 | no | 649 | 2017 | ||
| 4 | abs | 845 | 1999 | ||
| 4 | fun | 047 | 2012 | ||
| 4 | sqr | 176 | 2020 | Last | |
| 4 | sqr | 959 | 2020 |
I am trying to write a program to correct and update the dataframe with a new Updated_Marked column that will correctly display "Last" given the following conditions:
Within each ID 1 group, only look at rows with Type == "abs" | Type == "sqr" | Type == fun. In each group, look at the latest Date. If "Last" is already marked next to latest date, do nothing. If "Last" is not correct, update new column placing "Last" in latest row. And if "Last" is not correct and multiple rows are tied for latest (same date values), then put "multiple" next to each.
EDIT: I have come across one more case - where a row is originally correctly marked as "Last" but because other rows within the group have the same year, it marks the other years as "multiple". In this case, I would like to assume the row is originally correct and thus make no updates. Changes have been reflected in initial dataframe and output.
The updated dataframe will look like the following:
+------+------+------+------+--------+----------------+
| ID 1 | Type | Name | Date | Marked | Updated_Marked |
+------+------+------+------+--------+----------------+
| 1 | abs | 576 | 2009 | Last | |
| 1 | sqr | 573 | 2020 | | Multiple |
| 1 | fun | 234 | 2020 | | Multiple |
| 1 | no | 556 | 2020 | | |
| 2 | abs | 364 | 2015 | | |
| 2 | sqr | 937 | 2019 | | |
| 2 | fun | 562 | 2020 | Last | |
| 2 | ok | 794 | 2021 | | |
| 3 | abs | 846 | 2012 | Last | |
| 3 | fun | 540 | 2015 | | |
| 3 | sqr | 276 | 2016 | | Last |
| 3 | no | 649 | 2017 | | |
| 4 | abs | 845 | 1999 | | |
| 4 | fun | 047 | 2012 | | |
| 4 | sqr | 176 | 2020 | Last | |
| 4 | sqr | 959 | 2020 | | |
+------+------+------+------+--------+----------------+
EDIT2: after trying
msk1 = df['Type'].isin(['abs','sqr','fun'])
msk2 = df.groupby(['ID 1', msk1])['Date'].transform('max').eq(df['Date'])
msk3 = df['Marked'].ne('Last').groupby([df['ID 1'], msk1 & msk2]).cummin()
msks = msk1 & msk2 & msk3
v = df.loc[msks, 'ID 1']
df.loc[msks, 'Updated_Marked'] = v.map(v.value_counts().gt(1).map({True: 'Multiple', False: 'Last'}))
df = df.fillna('')
my dataframe looks like the following:
+------+------+------+------+--------+----------------+
| ID 1 | Type | Name | Date | Marked | Updated_Marked |
+------+------+------+------+--------+----------------+
| 1 | abs | 576 | 2009 | Last | |
| 1 | sqr | 573 | 2020 | | Multiple |
| 1 | fun | 234 | 2020 | | Multiple |
| 1 | no | 556 | 2020 | | |
| 2 | abs | 364 | 2015 | | |
| 2 | sqr | 937 | 2019 | | |
| 2 | fun | 562 | 2020 | Last | Last |
| 2 | ok | 794 | 2021 | | |
| 3 | abs | 846 | 2012 | Last | |
| 3 | fun | 540 | 2015 | | |
| 3 | sqr | 276 | 2016 | | Last |
| 3 | no | 649 | 2017 | | |
| 4 | abs | 845 | 1999 | | |
| 4 | fun | 047 | 2012 | | |
| 4 | sqr | 176 | 2020 | Last | Multiple |
| 4 | sqr | 959 | 2020 | | Multiple |
+------+------+------+------+--------+----------------+
Essentially, if a row is originally marked "Last" and it contains the latest date (even if there are others in the same date), I would like to leave the Updated_Marked column blank.
Solution 1:[1]
Here's one way using 3 masks. The idea is to identify the rows that must be updated, using its Type, latest Date, and if it's marked "Last". Then for those rows, assign values in Updated_Marked column rows depending on the number of ID 1s it shares with other rows
msk1 = df['Type'].isin(['abs','sqr','fun'])
msk2 = df.groupby(['ID 1', msk1])['Date'].transform('max').eq(df['Date'])
msk3 = df['Marked'].ne('Last').groupby([df['ID 1'], msk1 & msk2]).cummin()
msks = msk1 & msk2 & msk3
v = df.loc[msks, 'ID 1']
df.loc[msks, 'Updated_Marked'] = v.map(v.value_counts().gt(1).map({True: 'Multiple', False: 'Last'}))
df = df.fillna('')
Output:
ID 1 Type Name Date Marked Updated_Marked
0 1 abs 576 2009 Last
1 1 sqr 573 2020 Multiple
2 1 fun 234 2020 Multiple
3 1 no 556 2020
4 2 abs 364 2015
5 2 sqr 937 2019
6 2 fun 562 2020 Last
7 2 ok 794 2021
8 3 abs 846 2012 Last
9 3 fun 540 2015
10 3 sqr 276 2016 Last
11 3 no 649 2017
12 4 abs 845 1999
13 4 fun 47 2012
14 4 sqr 176 2020 Last
15 4 sqr 959 2020
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
