'Using column as tiebreaker for maximums in Python

Reposted with clarification.

I am working on a dataframe that looks like the following:

+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
|     1 |  5 | 2012 | 111 |
|     1 |  5 | 2012 | 112 |
|     0 | 12 | 2017 | 113 |
|     0 | 12 | 2022 | 114 |
|     1 | 27 | 2005 | 115 |
|     1 | 27 | 2011 | 116 |
+-------+----+------+-----+

Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:

+-------+----+------+------+-------+
| Value | ID | Date | ID 2 |Latest |
+-------+----+------+------+-------+
|     1 |  5 | 2012 | 111 |        |
|     1 |  5 | 2012 | 112 | Latest |
|     0 | 12 | 2017 | 113 |        |
|     0 | 12 | 2022 | 114 |        |
|     1 | 27 | 2005 | 115 |        |
|     1 | 27 | 2011 | 116 | Latest |
+-------+----+------+-----+--------+

I am using the following code to find the maximum:

latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df

But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups.

latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df

but it is giving me a dataframe completely different than the one desired.



Solution 1:[1]

Here's one way a bit similar to your own approach. Basically, groupby + last to get the latest + assign a variable + merge:

df = df.merge(df.groupby(['ID', 'Value'])['ID 2'].last().reset_index().assign(Latest=lambda x: np.where(x['Value'], 'Latest', '')), how='outer').fillna('')

or even this works:

df = df.query('Value==1').groupby('ID').last('ID 2').assign(Latest='Latest').merge(df, how='outer').fillna('')

Output:

   Value  ID  Date  ID 2  Latest
0      1   5  2012   111        
1      1   5  2012   112  Latest
2      0  12  2017   113        
3      0  12  2022   114        
4      1  27  2005   115        
5      1  27  2011   116  Latest

Solution 2:[2]

Try this:

df['Latest'] = np.where(df['ID2'].eq(df.groupby(df['Value'].ne(df['Value'].shift(1)).cumsum())['ID2'].transform('max')) & df['Value'].ne(0), 'Latest', '')

Output:

>>> df
   Value  ID  Date  ID2  Latest
0      1   5  2012  111        
1      1   5  2012  112  Latest
2      0  12  2017  113        
3      0  12  2022  114        
4      1  27  2005  115        
5      1  27  2011  116  Latest

Solution 3:[3]

Here is one with window functions:

c = df['Value'].ne(df['Value'].shift()) 
s = df['Date'].add(df['ID 2']) #add the year and ID for handling duplicates
c1 = s.eq(s.groupby(c.cumsum()).transform('max'))& (df['Value'].eq(1))

df['Latest'] = np.where(c1,'Latest','')

print(df)

   Value  ID  Date  ID 2  Latest
0      1   5  2012   111        
1      1   5  2012   112  Latest
2      0  12  2017   113        
3      0  12  2022   114        
4      1  27  2005   115        
5      1  27  2011   116  Latest

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 richardec
Solution 3 anky