'Loop Pandas DataFrame in window sizes, with conditions
I have a Pandas DataFrame:
price = [
26810., 27524., 27728., 25739., 25221., 25767., 24367., 25056.,
23899., 24164., 24981., 26953., 26953., 27632., 27089., 26153.,
25846., 27628., 30096., 31153., 31153.
]
lower = [
18012.32540124, 18930.80563138, 19349.47300484, 17600.62752257,
17315.25044609, 18098.73169538, 16919.43582196, 17812.46170156,
16859.50155408, 17327.29696869, 18347.99667756, 20478.26786004,
20656.18820516, 21494.88201214, 21116.26237045, 20359.80486595,
20228.98000157, 22085.23300384, 24515.87438892, 25390.78562257,
25390.78562257
]
upper = [
35607.67459876, 36117.19436862, 36106.52699516, 33877.37247743,
33126.74955391, 33435.26830462, 31814.56417804, 32299.53829844,
30938.49844592, 31000.70303131, 31614.00332244, 33427.73213996,
33249.81179484, 33769.11798786, 33061.73762955, 31946.19513405,
31463.01999843, 33170.76699616, 35676.12561108, 36915.21437743,
36915.21437743
]
y = [1., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0.]
df = pd.DataFrame({
'price': price,
'lower': lower,
'upper': upper,
'y': y})
df
price lower upper y
0 26810.0 18012.325401 35607.674599 1.0
1 27524.0 18930.805631 36117.194369 1.0
2 27728.0 19349.473005 36106.526995 1.0
3 25739.0 17600.627523 33877.372477 0.0
4 25221.0 17315.250446 33126.749554 0.0
5 25767.0 18098.731695 33435.268305 0.0
6 24367.0 16919.435822 31814.564178 1.0
7 25056.0 17812.461702 32299.538298 1.0
8 23899.0 16859.501554 30938.498446 1.0
9 32164.0 17327.296969 31000.703031 0.0
10 24981.0 18347.996678 31614.003322 0.0
11 26953.0 20478.267860 33427.732140 0.0
12 26953.0 20656.188205 33249.811795 1.0
13 27632.0 21494.882012 33769.117988 0.0
14 27089.0 21116.262370 33061.737630 0.0
15 26153.0 20359.804866 31946.195134 1.0
16 25846.0 20228.980002 31463.019998 1.0
17 27628.0 22085.233004 33170.766996 0.0
18 30096.0 24515.874389 35676.125611 1.0
19 31153.0 25390.785623 36915.214377 0.0
20 31153.0 25390.785623 36915.214377 0.0
21 16313.0 20228.980002 36915.214377 0.0
I want to be able to loop through the DataFrame rows in windows of a specified size i.e. window_sizeand copy the 'y' value forward through this window in a new column, 'y_new' until either:
We reach the end of the window. In which case, we then consider the next row and loop in over the next
window_sizerowsThe
pricecolumn exceeds eitherlowerorupperrow values at the start of the window - in which case, we stop the loop and start again at the succeeding row.
For example:
window_size == 5:
index == 0:
y== '1' # Longupper== 35,607lower== 18,012
We loop until index == 5:
y_newat index 0 to 5 == '1' becausepricedid not exceedupperorlower.
Then, we start at index position == 6:
y== '1' # Longupper== 35,607lower== 18,012
Index == 6 to index == 9:
y_new== 1
At Index == 9, we stop, as price at index 9 is > upper (at index 6).
We then start again: index == 10.
At index = 20, we break as our index is out of bounds.
Desired Output
df
price lower upper y y_new
0 26810.0 18012.325401 35607.674599 1.0 1.0
1 27524.0 18930.805631 36117.194369 1.0 1.0
2 27728.0 19349.473005 36106.526995 1.0 1.0
3 25739.0 17600.627523 33877.372477 0.0 1.0
4 25221.0 17315.250446 33126.749554 0.0 1.0
5 25767.0 18098.731695 33435.268305 0.0 1.0
6 24367.0 16919.435822 31814.564178 1.0 1.0
7 25056.0 17812.461702 32299.538298 1.0 1.0
8 23899.0 16859.501554 30938.498446 1.0 1.0
9 32164.0 17327.296969 31000.703031 0.0 0.0
10 24981.0 18347.996678 31614.003322 0.0 0.0
11 26953.0 20478.267860 33427.732140 0.0 0.0
12 26953.0 20656.188205 33249.811795 1.0 0.0
13 27632.0 21494.882012 33769.117988 0.0 0.0
14 27089.0 21116.262370 33061.737630 0.0 0.0
15 26153.0 20359.804866 31946.195134 1.0 1.0
16 25846.0 20228.980002 31463.019998 1.0 1.0
17 27628.0 22085.233004 33170.766996 0.0 1.0
18 30096.0 24515.874389 35676.125611 1.0 1.0
19 31153.0 25390.785623 36915.214377 0.0 1.0
20 31153.0 25390.785623 36915.214377 0.0 0.0
21 16313.0 20228.980002 36915.214377 0.0 0.0
Hopefully, this is clear.
Intuitively, this feels like something that cannot be accomplished with df.iterrows() and I would need to use a standard loop with break/ continue. I am unsure, however, how to implement the break and restart loop at the next index.
Many thanks for your help.
Solution 1:[1]
I believe I understand what you are trying to achieve, so given your input data for 'price', 'lower', 'upper', and 'y' which I note defines a slightly different dataframe than what you show in your example. the resulting df is shown below: df:
price lower upper y
0 26810.0 18012.325401 35607.674599 1.0
1 27524.0 18930.805631 36117.194369 1.0
2 27728.0 19349.473005 36106.526995 1.0
3 25739.0 17600.627523 33877.372477 0.0
4 25221.0 17315.250446 33126.749554 0.0
5 25767.0 18098.731695 33435.268305 0.0
6 24367.0 16919.435822 31814.564178 1.0
7 25056.0 17812.461702 32299.538298 1.0
8 23899.0 16859.501554 30938.498446 1.0
9 24164.0 17327.296969 31000.703031 0.0
10 24981.0 18347.996678 31614.003322 0.0
11 26953.0 20478.267860 33427.732140 0.0
12 26953.0 20656.188205 33249.811795 1.0
13 27632.0 21494.882012 33769.117988 0.0
14 27089.0 21116.262370 33061.737630 0.0
15 26153.0 20359.804866 31946.195134 1.0
16 25846.0 20228.980002 31463.019998 1.0
17 27628.0 22085.233004 33170.766996 0.0
18 30096.0 24515.874389 35676.125611 1.0
19 31153.0 25390.785623 36915.214377 0.0
20 31153.0 25390.785623 36915.214377 0.0
21 16313.0 20228.980002 36915.214377 0.0
I think to implement your logic correctly, you can do what you want using the following:
def addColumn(dfx:pd.DataFrame, winsze:int=5) -> list:
lower = 0
upper = 0
indx = 0
val = 0
rslt = []
cols = list(dfx.columns)
reset = True
wincnt = winsze
while indx < dfx.shape[0]:
if reset or wincnt == 0:
reset = False
lower = dfx.iat[indx, cols.index('lower')]
upper = dfx.iat[indx, cols.index('upper')]
val = dfx.iat[indx, cols.index('y')]
wincnt = winsze
rslt.append(val)
indx += 1
else:
if (dfx.iat[indx, cols.index('price')] >= lower and
dfx.iat[indx, cols.index('price')] <= upper):
rslt.append(val)
indx += 1
wincnt -= 1
else:
reset = True
return rslt
then assign new column using:
df['new_y'] = addColumn(df, 5)
which yields:
price lower upper y new_y
0 26810.0 18012.325401 35607.674599 1.0 1.0
1 27524.0 18930.805631 36117.194369 1.0 1.0
2 27728.0 19349.473005 36106.526995 1.0 1.0
3 25739.0 17600.627523 33877.372477 0.0 1.0
4 25221.0 17315.250446 33126.749554 0.0 1.0
5 25767.0 18098.731695 33435.268305 0.0 1.0
6 24367.0 16919.435822 31814.564178 1.0 1.0
7 25056.0 17812.461702 32299.538298 1.0 1.0
8 23899.0 16859.501554 30938.498446 1.0 1.0
9 24164.0 17327.296969 31000.703031 0.0 1.0
10 24981.0 18347.996678 31614.003322 0.0 1.0
11 26953.0 20478.267860 33427.732140 0.0 1.0
12 26953.0 20656.188205 33249.811795 1.0 1.0
13 27632.0 21494.882012 33769.117988 0.0 1.0
14 27089.0 21116.262370 33061.737630 0.0 1.0
15 26153.0 20359.804866 31946.195134 1.0 1.0
16 25846.0 20228.980002 31463.019998 1.0 1.0
17 27628.0 22085.233004 33170.766996 0.0 1.0
18 30096.0 24515.874389 35676.125611 1.0 1.0
19 31153.0 25390.785623 36915.214377 0.0 1.0
20 31153.0 25390.785623 36915.214377 0.0 1.0
21 16313.0 20228.980002 36915.214377 0.0 0.0
Solution 2:[2]
Here is what I have written to achieve the above. It is not the most elegant, or efficient by any stretch, it assumes structure and column order, and the condition to check where the iteration is near the end of the DataFrame is somewhat suboptimal, so please feel free to edit and improve:
def _assign_no_overlap(df: pd.DataFrame, window: int) -> pd.DataFrame:
assert all(np.isin(['price', 'lower', 'upper', 'y'], df.columns)), \
f"df missing cols: {df.columns[~df.columns.isin(['price', 'lower', 'upper', 'y'])]}"
df = df.copy()
df = df.dropna()
df['y_non_overlap'] = -1
# Get non-overlapping labels
for i in range(0, len(df)):
# Initial conditions
if i == 0:
initial_y = df.iloc[i, 3]
initial_upper = df.iloc[i, 2]
initial_lower = df.iloc[i, 1]
# Rolling conditions
price = df.iloc[i, 0]
df.iloc[i, 4] = initial_y # y_new == y
if (len(df)-i) < window:
break
# Price exceeds upper/lower, reevaluate y
if (price >= initial_upper) or (price <= initial_lower):
price = df.iloc[i, 0]
initial_y = df.iloc[i, 3]
initial_upper = df.iloc[i, 2]
initial_lower = df.iloc[i, 1]
# assign y_new to latest pred
df.iloc[i, 4] = initial_y # assign y_new to latest pred
continue
# At end of window - take new prediction
elif i % window == 0:
price = df.iloc[i, 0]
initial_y = df.iloc[i, 3]
initial_upper = df.iloc[i, 2]
initial_lower = df.iloc[i, 1]
df.iloc[i, 4] = initial_y # assign y_new to latest pred
continue
return df
df_n = _assign_no_overlap(df, 5)
price lower upper y y_non_overlap
0 26810.0 18012.325401 35607.674599 1.0 1.0
1 27524.0 18930.805631 36117.194369 1.0 1.0
2 27728.0 19349.473005 36106.526995 1.0 1.0
3 25739.0 17600.627523 33877.372477 0.0 1.0
4 25221.0 17315.250446 33126.749554 0.0 1.0
5 25767.0 18098.731695 33435.268305 0.0 0.0
6 24367.0 16919.435822 31814.564178 1.0 0.0
7 25056.0 17812.461702 32299.538298 1.0 0.0
8 23899.0 16859.501554 30938.498446 1.0 0.0
9 32164.0 17327.296969 31000.703031 0.0 0.0
10 24981.0 18347.996678 31614.003322 0.0 0.0
11 26953.0 20478.267860 33427.732140 0.0 0.0
12 26953.0 20656.188205 33249.811795 1.0 0.0
13 27632.0 21494.882012 33769.117988 0.0 0.0
14 27089.0 21116.262370 33061.737630 0.0 0.0
15 26153.0 20359.804866 31946.195134 1.0 1.0
16 25846.0 20228.980002 31463.019998 1.0 1.0
17 27628.0 22085.233004 33170.766996 0.0 1.0
18 30096.0 24515.874389 35676.125611 1.0 1.0
19 31153.0 25390.785623 36915.214377 0.0 -1.0
20 31153.0 25390.785623 36915.214377 0.0 -1.0
21 16313.0 20228.980002 36915.214377 0.0 -1.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | itprorh66 |
| Solution 2 | cmp |
