'is there efficient way for pandas to get tail rows with a condition

I want to get tail rows with a condition

For example: I want to get all negative tail rows from a column 'A' like:

    test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})

I expect a 'method' to get new frame like:

        A
    0  -1
    1  -2
    2  -3

note that, it is not certain of how many 'negative' numbers are in the tail. So I can not run test.tail(3)

It looks like the pandas provided 'tail()' function can only run with a given number.

But my input data frame might be too large that I dont want run a simple loop to check one by one

Is there a smart way to do that?



Solution 1:[1]

What's the tail for? It seems like you just need the negative numbers

test.query("A < 0")

Update: Find where sign changes, split the array and choose last one

split_points = (test.A.shift(1)<0) == (test.A<0) 
np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]

Output:

    A
7   -1
8   -2
9   -3

Solution 2:[2]

Just share a picture of performance comparing above two given answers Thansk Patry and Macro enter image description here

Solution 3:[3]

Is this what you wanted?

test = pd.DataFrame({'A':[-8, -9, -10, 1, 2, 3, 0, -1,-2,-3]})

test = test.iloc[::-1]

test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1]

Output:

    A
9   -3
8   -2
7   -1

edit, if you want to get it back into the original order:

test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1].iloc[::-1]

    A
7   -1
8   -2
9   -3

Optional also .reset_index(drop=True) if you need a index starting at 0.

Solution 4:[4]

I improved my above test, and did another round test, as I feel the old 'testing sample' size was too small,and afaid the %%time measurement might not accurate.

My new test uses a very big head numbers with size of 10000000 and tail with 3 negative numbers

so the new test can prove how the whole data frame size impact the over all performance.

code is like bellow:

    %%time
    arr = np.arange(1,10000000,1)
    arr = np.concatenate((arr, [-2,-3,-4]))
    test = pd.DataFrame({'A':arr})
    test = test.iloc[::-1]
    test.loc[test.index.max():test[test['A'].ge(0)].index[0]+1].iloc[::-1]
    %%time
    arr = np.arange(1,10000000,1)
    arr = np.concatenate((arr, [-2,-3,-4]))
    test = pd.DataFrame({'A':arr})
    split_points = (test.A.shift(1)<0) == (test.A<0) 
    np.split(test, split_points.loc[lambda x: x==False].index.tolist())[-1]

due to system impacts, I tested 10 times, the above 2 methods are very much performs the similar. In about 50% cases Patryk's code even performs faster

Check out this image bellow

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 yunfei
Solution 3
Solution 4 Marco_CH