'speed up drop rows base on pandas column values

I have a very large pandas data frame, which looks something like

df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
                   "Day": [1, 2, 3, 1, 2, 3, 1, 2],
                   "Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})

and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.

stations =  pd.unique(df['Station'])
n = 3

def complete(temp):
    for k in range(len(stations)):
        if len(temp[temp['Station']== stations[k]].Temp) != n:
            temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)

I've been looking into using @jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use @jit to create a new list of v = df['Station'] only containing the rows that I want and then to use df = df[df.Station.isin(v)] to filter out the rows of the entire data frame or is there a better way?

Solution 1:^[1]

You can groupby "Station" and transform count method and compare the counts with n to create a boolean Series. Then use this mask to filter the relevant rows:

n=3
msk = df.groupby('Station')['Temp'].transform('count').eq(n)
df = df[msk]

Output:

   Station  Day  Temp
0        2    1  -7.0
1        2    2   2.7
2        2    3  -1.3
3        5    1  -1.9
4        5    2   0.2
5        5    3   0.5

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'speed up drop rows base on pandas column values

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]