'Drop duplicates keeping the row with the highest value in another column

a = [['John', 'Mary', 'John'], [10,22,50]]
df1 = pd.DataFrame(a, columns=['Name', 'Count'])

Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python.

Ex: In the case above the Answer would be:

  • Name Count
  • Mary 22
  • John 50

The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name").

In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object.

Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D



Solution 1:[1]

Either sort_values and drop_duplicates,

df1.sort_values('Count').drop_duplicates('Name', keep='last')

   Name  Count
1  Mary     22
2  John     50

Or, like miradulo said, groupby and max.

df1.groupby('Name')['Count'].max().reset_index()

   Name  Count
0  John     50
1  Mary     22

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1