'In Pandas, how to return the id for the next value which is above/below a threshold

I have a dataframe like this:

    date                value
0   2018-05-15 06:00:00 100.86
1   2018-05-15 07:00:00 101.99
2   2018-05-15 08:00:00 110.00
3   2018-05-15 09:00:00 95.49
4   2018-05-15 10:00:00 92.32

I would like to create a new column which tells me the index of the next value which is a certain amount above or below. for eg over 5, under 3

    date                value  Over_5 Under_3
0   2018-05-15 06:00:00 100.86  2     3
1   2018-05-15 07:00:00 101.99  2     3
2   2018-05-15 08:00:00 110.00  Nan   3
3   2018-05-15 09:00:00 95.49   Nan   4
4   2018-05-15 10:00:00 92.32   Nan   Nan

with the view that ideally I would like to use a Boolean operator to return which comes first, Over 5 (0) or Under 3(1)?

    date                value  Over_5 Under_3 Bool
0   2018-05-15 06:00:00 100.86  2     3       1
1   2018-05-15 07:00:00 101.99  2     3       1
2   2018-05-15 08:00:00 110.00  Nan   3       Nan
3   2018-05-15 09:00:00 95.49   Nan   4       Nan
4   2018-05-15 10:00:00 92.32   Nan   Nan     Nan

Apologies, I know its not the best example. My current thinking has been a groupby with idxmax, but I'm not really sure how to do it, except maybe with a while/for loop.

Its to label encode for an ML project so a good way to carry this out, with vectorization, would be great Thanks



Solution 1:[1]

What about:

df = pd.DataFrame({'date': pd.to_datetime(['2018-05-15 06:00:00', '2018-05-15 07:00:00', '2018-05-15 08:00:00',
                            '2018-05-15 09:00:00', '2018-05-15 10:00:00']), 
                   'value': [100.86, 101.99, 110.00, 95.49, 92.32]})

df['Over_5'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
                        .where(lambda v: v > 5).first_valid_index(), axis=1)
df['Under_3'] = df.apply(lambda row: (df[df.index>=row.name].value - row.value)
                         .where(lambda v: v < -3).first_valid_index(), axis=1)

df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)

print(df)

# Prints:
                 date   value  Over_5  Under_3  Bool
0 2018-05-15 06:00:00  100.86     2.0      3.0   1.0
1 2018-05-15 07:00:00  101.99     2.0      3.0   1.0
2 2018-05-15 08:00:00  110.00     NaN      3.0   NaN
3 2018-05-15 09:00:00   95.49     NaN      4.0   NaN
4 2018-05-15 10:00:00   92.32     NaN      NaN   NaN

Some details:

# Gets the values that come after the current row.
df[df.index>=row.name].value

# Function that subtracts the value of the current row.
lambda row: (df[df.index>=row.name].value - row.value)

# Function that gets the first index where the difference is > 5
f = lambda row: (df[df.index>=row.name].value - row.value).where(lambda v: v > 5).first_valid_index()

# Apply the function of the previous line to every row.
df.apply(f, axis=1)

Numpy alternative:

An alternative with numpy which should be faster:

import numpy as np 

arr = df['value'].to_numpy()
# Calculate all differences between values at once.
# Then, take the upper triangular matrix (i.e., only differences with values coming afterwards).
all_diffs = np.triu(arr - arr[:, None])

# Check for the first index where the condition is fulfilled.
df['Over_5'] = np.argmax(all_diffs > 5, axis=1)
df['Under_3'] = np.argmax(all_diffs < -3, axis=1) 
df[['Over_5', 'Under_3']] = df[['Over_5', 'Under_3']].replace(0, np.nan)

df['Bool'] = (df['Over_5'] < df['Under_3']).replace(False, np.nan).astype(float)

print(df)
                 date   value  Over_5  Under_3  Bool
0 2018-05-15 06:00:00  100.86     2.0      3.0   1.0
1 2018-05-15 07:00:00  101.99     2.0      3.0   1.0
2 2018-05-15 08:00:00  110.00     NaN      3.0   NaN
3 2018-05-15 09:00:00   95.49     NaN      4.0   NaN
4 2018-05-15 10:00:00   92.32     NaN      NaN   NaN

Runtime difference:

# 2.92 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit calc_pandas()

# 587 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit calc_numpy()

So as you can see the numpy version is about 6 times faster with your example and potentially more with bigger data sets. It requires a bit more memory to calculate the matrix but it should be fine unless you have a really big dataframe.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1