'is there efficient way for pandas to get tail rows with a condition
I want to get tail rows with a condition
For example: I want to get all negative tail rows from a column 'A' like:
test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})
I expect a 'method' to get new frame like:
A
0 -1
1 -2
2 -3
note that, it is not certain of how many 'negative' numbers are in the tail. So I can not run test.tail(3)
It looks like the pandas provided 'tail()' function can only run with a given number.
But my input data frame might be too large that I dont want run a simple loop to check one by one
Is there a smart way to do that?
Solution 1:[1]
What's the tail for? It seems like you just need the negative numbers
test.query("A < 0")
Update: Find where sign changes, split the array and choose last one
split_points = (test.A.shift(1)<0) == (test.A<0)
np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]
Output:
A
7 -1
8 -2
9 -3
Solution 2:[2]
Solution 3:[3]
Is this what you wanted?
test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})
test = test.iloc[::-1]
test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1]
Output:
A
9 -3
8 -2
7 -1
edit, if you want to get it back into the original order:
test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1].iloc[::-1]
A
7 -1
8 -2
9 -3
Optional also .reset_index(drop=True) if you need a index starting at 0.
Solution 4:[4]
I improved my above test, and did another round test, as I feel the old 'testing sample' size was too small,and afaid the %%time measurement might not accurate.
My new test uses a very big head numbers with size of 10000000 and tail with 3 negative numbers
so the new test can prove how the whole data frame size impact the over all performance.
code is like bellow:
%%time
arr = np.arange(1,10000000,1)
arr = np.concatenate((arr, [-2,-3,-4]))
test = pd.DataFrame({'A':arr})
test = test.iloc[::-1]
test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1].iloc[::-1]
%%time
arr = np.arange(1,10000000,1)
arr = np.concatenate((arr, [-2,-3,-4]))
test = pd.DataFrame({'A':arr})
split_points = (test.A.shift(1)<0) == (test.A<0)
np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]
due to system impacts, I tested 10 times, the above 2 methods are very much performs the similar. In about 50% cases Patryk's code even performs faster
Check out this image bellow
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | yunfei |
| Solution 3 | |
| Solution 4 | Marco_CH |


