'Efficient way to get the N largest values of a column

I need to get the w highest values of a column groupying by Country.

The code below is working:

w = 100
df.groupby('country').apply(lambda x: x.sort_values('x', ascending=False).head(w)

Is there a way to make this code more efficient? My dataset is huge, like 30kk rows.



Solution 1:[1]

You can try pandas.core.groupby.SeriesGroupBy.nlargest

w = 100
df.groupby('country').nlargest(w)

According to the doc

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Since your w=100 is small relative to 30kk, it will be faster.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ynjxsjmh