'Loop Pandas DataFrame in window sizes, with conditions

I have a Pandas DataFrame:

price = [
    26810., 27524., 27728., 25739., 25221., 25767., 24367., 25056.,
    23899., 24164., 24981., 26953., 26953., 27632., 27089., 26153.,
    25846., 27628., 30096., 31153., 31153.
]

lower = [
    18012.32540124, 18930.80563138, 19349.47300484, 17600.62752257,
    17315.25044609, 18098.73169538, 16919.43582196, 17812.46170156,
    16859.50155408, 17327.29696869, 18347.99667756, 20478.26786004,
    20656.18820516, 21494.88201214, 21116.26237045, 20359.80486595,
    20228.98000157, 22085.23300384, 24515.87438892, 25390.78562257,
    25390.78562257
]

upper = [
    35607.67459876, 36117.19436862, 36106.52699516, 33877.37247743,
    33126.74955391, 33435.26830462, 31814.56417804, 32299.53829844,
    30938.49844592, 31000.70303131, 31614.00332244, 33427.73213996,
    33249.81179484, 33769.11798786, 33061.73762955, 31946.19513405,
    31463.01999843, 33170.76699616, 35676.12561108, 36915.21437743,
    36915.21437743
]

y = [1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0.]


df = pd.DataFrame({
    'price': price, 
    'lower': lower, 
    'upper': upper, 
    'y': y})

df

    price   lower           upper           y
0   26810.0 18012.325401    35607.674599    1.0
1   27524.0 18930.805631    36117.194369    1.0
2   27728.0 19349.473005    36106.526995    1.0
3   25739.0 17600.627523    33877.372477    0.0
4   25221.0 17315.250446    33126.749554    0.0
5   25767.0 18098.731695    33435.268305    0.0
6   24367.0 16919.435822    31814.564178    1.0
7   25056.0 17812.461702    32299.538298    1.0
8   23899.0 16859.501554    30938.498446    1.0
9   32164.0 17327.296969    31000.703031    0.0
10  24981.0 18347.996678    31614.003322    0.0
11  26953.0 20478.267860    33427.732140    0.0
12  26953.0 20656.188205    33249.811795    1.0
13  27632.0 21494.882012    33769.117988    0.0
14  27089.0 21116.262370    33061.737630    0.0
15  26153.0 20359.804866    31946.195134    1.0
16  25846.0 20228.980002    31463.019998    1.0
17  27628.0 22085.233004    33170.766996    0.0
18  30096.0 24515.874389    35676.125611    1.0
19  31153.0 25390.785623    36915.214377    0.0
20  31153.0 25390.785623    36915.214377    0.0
21  16313.0 20228.980002    36915.214377    0.0

I want to be able to loop through the DataFrame rows in windows of a specified size i.e. window_sizeand copy the 'y' value forward through this window in a new column, 'y_new' until either:

  1. We reach the end of the window. In which case, we then consider the next row and loop in over the next window_size rows

  2. The price column exceeds either lower or upper row values at the start of the window - in which case, we stop the loop and start again at the succeeding row.

For example:

window_size == 5:

index == 0:

  • y == '1' # Long
  • upper == 35,607
  • lower == 18,012

We loop until index == 5:

  • y_new at index 0 to 5 == '1' because price did not exceed upper or lower.

Then, we start at index position == 6:

  • y == '1' # Long
  • upper == 35,607
  • lower == 18,012

Index == 6 to index == 9:

  • y_new == 1

At Index == 9, we stop, as price at index 9 is > upper (at index 6).

We then start again: index == 10.

At index = 20, we break as our index is out of bounds.

Desired Output

df
    price   lower           upper           y    y_new
0   26810.0 18012.325401    35607.674599    1.0  1.0
1   27524.0 18930.805631    36117.194369    1.0  1.0
2   27728.0 19349.473005    36106.526995    1.0  1.0
3   25739.0 17600.627523    33877.372477    0.0  1.0
4   25221.0 17315.250446    33126.749554    0.0  1.0
5   25767.0 18098.731695    33435.268305    0.0  1.0
6   24367.0 16919.435822    31814.564178    1.0  1.0
7   25056.0 17812.461702    32299.538298    1.0  1.0
8   23899.0 16859.501554    30938.498446    1.0  1.0
9   32164.0 17327.296969    31000.703031    0.0  0.0
10  24981.0 18347.996678    31614.003322    0.0  0.0
11  26953.0 20478.267860    33427.732140    0.0  0.0
12  26953.0 20656.188205    33249.811795    1.0  0.0
13  27632.0 21494.882012    33769.117988    0.0  0.0
14  27089.0 21116.262370    33061.737630    0.0  0.0
15  26153.0 20359.804866    31946.195134    1.0  1.0
16  25846.0 20228.980002    31463.019998    1.0  1.0
17  27628.0 22085.233004    33170.766996    0.0  1.0
18  30096.0 24515.874389    35676.125611    1.0  1.0
19  31153.0 25390.785623    36915.214377    0.0  1.0
20  31153.0 25390.785623    36915.214377    0.0  0.0
21  16313.0 20228.980002    36915.214377    0.0  0.0

Hopefully, this is clear.

Intuitively, this feels like something that cannot be accomplished with df.iterrows() and I would need to use a standard loop with break/ continue. I am unsure, however, how to implement the break and restart loop at the next index.

Many thanks for your help.



Solution 1:[1]

I believe I understand what you are trying to achieve, so given your input data for 'price', 'lower', 'upper', and 'y' which I note defines a slightly different dataframe than what you show in your example. the resulting df is shown below: df:

    price   lower           upper           y
0   26810.0 18012.325401    35607.674599    1.0
1   27524.0 18930.805631    36117.194369    1.0
2   27728.0 19349.473005    36106.526995    1.0
3   25739.0 17600.627523    33877.372477    0.0
4   25221.0 17315.250446    33126.749554    0.0
5   25767.0 18098.731695    33435.268305    0.0
6   24367.0 16919.435822    31814.564178    1.0
7   25056.0 17812.461702    32299.538298    1.0
8   23899.0 16859.501554    30938.498446    1.0
9   24164.0 17327.296969    31000.703031    0.0
10  24981.0 18347.996678    31614.003322    0.0
11  26953.0 20478.267860    33427.732140    0.0
12  26953.0 20656.188205    33249.811795    1.0
13  27632.0 21494.882012    33769.117988    0.0
14  27089.0 21116.262370    33061.737630    0.0
15  26153.0 20359.804866    31946.195134    1.0
16  25846.0 20228.980002    31463.019998    1.0
17  27628.0 22085.233004    33170.766996    0.0
18  30096.0 24515.874389    35676.125611    1.0
19  31153.0 25390.785623    36915.214377    0.0
20  31153.0 25390.785623    36915.214377    0.0
21  16313.0 20228.980002    36915.214377    0.0

I think to implement your logic correctly, you can do what you want using the following:

def addColumn(dfx:pd.DataFrame, winsze:int=5) -> list:
    lower = 0
    upper = 0
    indx = 0
    val = 0
    rslt = []
    cols = list(dfx.columns)
    reset = True
    wincnt = winsze
    while indx < dfx.shape[0]:
        if reset or wincnt == 0:
            reset = False
            lower = dfx.iat[indx, cols.index('lower')]
            upper = dfx.iat[indx, cols.index('upper')]
            val = dfx.iat[indx, cols.index('y')]
            wincnt = winsze
            rslt.append(val)
            indx += 1
        else:
            if (dfx.iat[indx, cols.index('price')] >= lower and
               dfx.iat[indx, cols.index('price')] <= upper):
                    rslt.append(val)
                    indx += 1
                    wincnt -= 1
            else:
                reset = True
    return rslt

then assign new column using: df['new_y'] = addColumn(df, 5)
which yields:

    price   lower           upper           y   new_y
0   26810.0 18012.325401    35607.674599    1.0 1.0
1   27524.0 18930.805631    36117.194369    1.0 1.0
2   27728.0 19349.473005    36106.526995    1.0 1.0
3   25739.0 17600.627523    33877.372477    0.0 1.0
4   25221.0 17315.250446    33126.749554    0.0 1.0
5   25767.0 18098.731695    33435.268305    0.0 1.0
6   24367.0 16919.435822    31814.564178    1.0 1.0
7   25056.0 17812.461702    32299.538298    1.0 1.0
8   23899.0 16859.501554    30938.498446    1.0 1.0
9   24164.0 17327.296969    31000.703031    0.0 1.0
10  24981.0 18347.996678    31614.003322    0.0 1.0
11  26953.0 20478.267860    33427.732140    0.0 1.0
12  26953.0 20656.188205    33249.811795    1.0 1.0
13  27632.0 21494.882012    33769.117988    0.0 1.0
14  27089.0 21116.262370    33061.737630    0.0 1.0
15  26153.0 20359.804866    31946.195134    1.0 1.0
16  25846.0 20228.980002    31463.019998    1.0 1.0
17  27628.0 22085.233004    33170.766996    0.0 1.0
18  30096.0 24515.874389    35676.125611    1.0 1.0
19  31153.0 25390.785623    36915.214377    0.0 1.0
20  31153.0 25390.785623    36915.214377    0.0 1.0
21  16313.0 20228.980002    36915.214377    0.0 0.0  

Solution 2:[2]

Here is what I have written to achieve the above. It is not the most elegant, or efficient by any stretch, it assumes structure and column order, and the condition to check where the iteration is near the end of the DataFrame is somewhat suboptimal, so please feel free to edit and improve:

def _assign_no_overlap(df: pd.DataFrame, window: int) -> pd.DataFrame:
    assert all(np.isin(['price', 'lower', 'upper', 'y'], df.columns)), \
    f"df missing cols: {df.columns[~df.columns.isin(['price', 'lower', 'upper', 'y'])]}"
    df = df.copy()
    df = df.dropna()
    df['y_non_overlap'] = -1
    # Get non-overlapping labels
    for i in range(0, len(df)):
        # Initial conditions
        if i == 0:
            initial_y = df.iloc[i, 3]
            initial_upper = df.iloc[i, 2]
            initial_lower = df.iloc[i, 1]
        # Rolling conditions
        price = df.iloc[i, 0]
        df.iloc[i, 4] = initial_y # y_new == y
        if (len(df)-i) < window:
            break
        # Price exceeds upper/lower, reevaluate y
        if (price >= initial_upper) or (price <= initial_lower):
            price = df.iloc[i, 0]
            initial_y = df.iloc[i, 3]
            initial_upper = df.iloc[i, 2]
            initial_lower = df.iloc[i, 1]
            # assign y_new to latest pred
            df.iloc[i, 4] = initial_y  # assign y_new to latest pred
            continue
        # At end of window - take new prediction
        elif i % window == 0:
            price = df.iloc[i, 0]
            initial_y = df.iloc[i, 3]
            initial_upper = df.iloc[i, 2]
            initial_lower = df.iloc[i, 1]
            df.iloc[i, 4] = initial_y # assign y_new to latest pred
            continue

    return df

df_n = _assign_no_overlap(df, 5)

    price   lower           upper           y    y_non_overlap
0   26810.0 18012.325401    35607.674599    1.0  1.0
1   27524.0 18930.805631    36117.194369    1.0  1.0
2   27728.0 19349.473005    36106.526995    1.0  1.0
3   25739.0 17600.627523    33877.372477    0.0  1.0
4   25221.0 17315.250446    33126.749554    0.0  1.0
5   25767.0 18098.731695    33435.268305    0.0  0.0
6   24367.0 16919.435822    31814.564178    1.0  0.0
7   25056.0 17812.461702    32299.538298    1.0  0.0
8   23899.0 16859.501554    30938.498446    1.0  0.0
9   32164.0 17327.296969    31000.703031    0.0  0.0
10  24981.0 18347.996678    31614.003322    0.0  0.0
11  26953.0 20478.267860    33427.732140    0.0  0.0
12  26953.0 20656.188205    33249.811795    1.0  0.0
13  27632.0 21494.882012    33769.117988    0.0  0.0
14  27089.0 21116.262370    33061.737630    0.0  0.0
15  26153.0 20359.804866    31946.195134    1.0  1.0
16  25846.0 20228.980002    31463.019998    1.0  1.0
17  27628.0 22085.233004    33170.766996    0.0  1.0
18  30096.0 24515.874389    35676.125611    1.0  1.0
19  31153.0 25390.785623    36915.214377    0.0  -1.0
20  31153.0 25390.785623    36915.214377    0.0  -1.0
21  16313.0 20228.980002    36915.214377    0.0  -1.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 itprorh66
Solution 2 cmp