'speed up drop rows base on pandas column values
I have a very large pandas data frame, which looks something like
df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
"Day": [1, 2, 3, 1, 2, 3, 1, 2],
"Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})
and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.
stations = pd.unique(df['Station'])
n = 3
def complete(temp):
for k in range(len(stations)):
if len(temp[temp['Station']== stations[k]].Temp) != n:
temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)
I've been looking into using @jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use @jit to create a new list of v = df['Station'] only containing the rows that I want and then to use df = df[df.Station.isin(v)] to filter out the rows of the entire data frame or is there a better way?
Solution 1:[1]
You can groupby "Station" and transform count method and compare the counts with n to create a boolean Series. Then use this mask to filter the relevant rows:
n=3
msk = df.groupby('Station')['Temp'].transform('count').eq(n)
df = df[msk]
Output:
Station Day Temp
0 2 1 -7.0
1 2 2 2.7
2 2 3 -1.3
3 5 1 -1.9
4 5 2 0.2
5 5 3 0.5
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
