'Isolate 'events' from either an array or Dataframe

I have a dataset of rainfall values which is binned into one hour bins. the data is in either an array, or a dataframe, I will use whichever is easier to get the desired result.

The dataset is approx 7000 rows by 11 columns, and has nan values which need to stay.

I apologise for pasting such a large chunk of the df, but it is needed to ask the question. Apparently copying the code snippet and then running newest=pd.read_clipboard(sep='\s\s+') will give you a dataframe called 'newest' which will illustrate my question.

What I want to do is isolate the section between rows 73 and 96. this can be defined as a 'rain event' and can be identified as beginning when there are 6 consecutive zeros in the 'highpoint sum' column (rows 68 -73) before this column has positive values and ending when columns 'ic mean', 'mc mean', 'md mean' and 'imd mean' are all zero.

I have no idea if this is possible, or if i have explained it in a way that will satisfy the Stackoverflow police, and would be happy to discuss/explain further if anyone is so inclined!

Hopefully someone can help :)

I have tried, and failed, to put some of the data here in a way in which it can be exported. so i will now post a screenshot of a very small part of the dataframe which demonstrates the problem. I want to extract the subset between 2021-05-07 01:00 and 2021-05-09 03:00:00.

these blocks will appear throughout the dataframe, at non-regular intervals, and be different sizes.

datetime    ic_mean mc_mean md_mean imd_mean    oak_mean    highpoint_sum   school_sum  open_sum    bresser_sum open_mean
2021-05-07 21:00:00 0.0 0.0 0.0 0.0     0   0.0         0.0
2021-05-07 22:00:00 0.0 0.0 0.0 0.0     0   0.0         0.0
2021-05-07 23:00:00 0.0 0.0 0.0 0.0     0   0.0         0.0
2021-05-08 00:00:00 0.0 0.0 0.0 0.0     0   0.0         0.0
2021-05-08 01:00:00 0.0 0.0 0.0 0.0635000149999998      0   0.0         0.0
2021-05-08 02:00:00 0.0 0.0 0.6985001540000007  0.6985001537499997      1.2700002800000014  0.0         0.6350001400000007
2021-05-08 03:00:00 0.95250021125   0.8255001820000003  1.6510003682499992  1.651000366     2.54000057  0.0         1.270000285
2021-05-08 04:00:00 1.7145003815000002  2.667000590749999   2.9845006635000013  2.2225004947499993      3.8100008400000007  0.0         1.9050004200000004
2021-05-08 05:00:00 2.66700059125   3.17500070525   2.9845006594999983  2.6035005772500006      3.556000789999999   0.0         1.7780003949999994
2021-05-08 06:00:00 2.5400005629999995  2.6035005800000004  2.984500662750001   3.1750007055        4.572001019999998   0.0         2.286000509999999
2021-05-08 07:00:00 3.7465008315    3.3655007474999996  3.7465008330000007  3.36550074675       6.350001410000001   0.0         3.1750007050000004
2021-05-08 08:00:00 3.619500802 4.5720010125    4.3180009575    3.746500832499999       6.096001350000002   0.0         3.048000675000001
2021-05-08 09:00:00 2.73050061  3.5560007900000006  2.7940006225    2.34950052      3.8100008500000015  0.0         1.9050004250000008
2021-05-08 10:00:00                         0.0         0.0
2021-05-08 11:00:00 1.968500439999998   1.9685004375000004  1.5875003524999993  2.095500465000002       2.032000449999998   0.0         1.016000224999999
2021-05-08 12:00:00 1.143000252500002   1.016000225 0.8890001949999986  1.3335002975000005      1.5240003400000006  0.0         0.7620001700000003
2021-05-08 13:00:00 0.5080001124999987  0.7620001699999985  0.5715001275000011  0.825500182499999       1.0160002200000022  0.0         0.5080001100000011
2021-05-08 14:00:00 0.25400005750000076 0.4445000975000024  0.31750007500000077 0.6350001400000007      0.25400006000000275 0.0         0.12700003000000137
2021-05-08 15:00:00 0.5715001274999993  0.3810000849999984  0.38100007999999796 0.7620001699999985      0.7620001700000003  0.19999999999998863         0.48100008499999447
2021-05-08 16:00:00 0.9525002100000011  1.1430002550000022  0.8255001850000019  1.0795002399999998      1.2700002799999979  0.0         0.6350001399999989
2021-05-08 17:00:00 0.25400005500000056 0.2540000549999988  0.12700002750000117 0.6350001400000007      0.25400006000000275 0.20000000000000284         0.2270000300000028
2021-05-08 18:00:00 1.7780003924999983  1.5875003524999993  1.3970003124999977  1.5240003400000006      2.286000499999993   0.0         1.1430002499999965
2021-05-08 19:00:00 1.7780003975000005  1.7780003949999994  1.3335002950000021  1.4605003249999982      2.032000449999998   0.20000000000000284         1.1160002250000005
2021-05-08 20:00:00 2.1590004800000004  2.4130005350000037  2.413000532499999   1.8415004074999999      3.556000789999999   0.0         1.7780003949999994
2021-05-08 21:00:00 1.7145003775000003  2.2225004899999963  1.6510003674999982  1.4605003224999997      2.286000510000008   0.0         1.143000255000004
2021-05-08 22:00:00 0.8890001999999999  0.8890002000000017  0.6985001550000005  0.8890001975000033      0.7620001699999932  0.0         0.3810000849999966
2021-05-08 23:00:00 0.2540000524999986  0.38100008500000015 0.31750006999999947 0.44450009999999907     0.7620001700000074  0.20000000000000284         0.48100008500000513
2021-05-09 00:00:00 0.0 0.1905000424999983  0.12700003000000137 0.1905000399999981      0   0.3999999999999915          0.19999999999999574
2021-05-09 01:00:00 0.0 0.0 0.0 0.0     0   0.4000000000000057          0.20000000000000284
2021-05-09 02:00:00 0.1270000299999996  0.06350001499999891 0.0 0.06350001500000246     0   0.0         0.0
2021-05-09 03:00:00 0.0 0.0 0.0 0.06350001499999891     0   0.4000000000000057          0.20000000000000284
2021-05-09 04:00:00 0.0 0.0 0.0 0.0     0   0.0         0.0
2021-05-09 05:00:00 0.0 0.0 0.0 0.06350001250000048     0   0.0         0.0
2021-05-09 06:00:00 0.0 0.0 0.0 0.06350001499999891     0   0.0         0.0
2021-05-09 07:00:00 0.0 0.0 0.0 0.0     0   0.0         0.0
2021-05-09 08:00:00 0.0 0.0 0.0 0.06350001250000048     0   0.0         0.0


Solution 1:[1]

The following code is how I have solved this question, It produces a dataframe 'eventdf' which has the start and end index of each event in the original dataframe.

exec(open ('int_no_plots.py').read()) #runs int_np_plots to produce meandf


meandf2=meandf.copy()
meandf2.reset_index(inplace=True)
meandf2.drop(meandf2.iloc[:,12:42], inplace=True, axis=1)
col = meandf2["highpoint_sum"]



delta1 = int(input("dry hours before event?: "))
delta2 = int(input("dry hours after event?: "))###make this one more than delta1
cl=len(col) ##number of rain free hours
rowlist=[]
seglist=[]
sum_delta =np.zeros((cl,3))
for i in range(int(len(col))):
    segment = col[i: i+delta1]
    segdf=segment.to_frame()
    segdf['index']=segdf.index
    sum_delta[i][0]=i
    segsum=segment.sum()
    sum_delta[i][1]=segsum
    sum_delta[i][2]=segdf.iloc[0,1]
    segsum2 = segment.iloc[0:delta1-1].sum()
    seglist.append(segment)
    if len(segment)>delta1-1:
        if segsum > 0 and segment.iloc[0:delta1-1].sum()==0:
            rowlist.append(segdf.iloc[delta1-1,1])
    continue   
        

rowlistx=[]
seglistx=[]
sum_deltax =np.zeros((cl,3))

for i in range(int(len(col))):
    segmentx = col[i: i+delta2]
    segdfx=segmentx.to_frame()
    segdfx['index']=segdfx.index
    sum_deltax[i][0]=i
    segsumx=segmentx.sum()
    sum_deltax[i][1]=segsumx
    sum_deltax[i][2]=segdfx.iloc[0,1]
    segsum2x = segmentx.iloc[1:delta2].sum()
    seglistx.append(segmentx)
    if len(segmentx) > delta2-1:
        if segsumx > 0 and segmentx.iloc[1:delta2].sum()==0:
            rowlistx.append(segdfx.iloc[delta2-1,1])   ###gives index 8 hours after rain stops to allow for drip through
    continue

if len(rowlist) > len (rowlistx):
    rowlist.pop()
    

eventdf=pd.DataFrame(rowlist, columns=['start_line'])
eventdf['end_line']=rowlistx

Solution 2:[2]

To demo the idea, here is my input dataframe,

df = pd.DataFrame({
    'new_if_6_zero': 
            [0., 2, 0, 0, 0, 0, 0, 0, 3, 2, 4, 5],
    'end_if_zero1': 
            [3., 0, 4, 5, 4, 3, 5, 6, 6, 1, 0, 2],
    'end_if_zero2': 
            [3., 0, 4, 5, 4, 3, 5, 6, 6, 0, 0, 1],
})

print(df)

    new_if_6_zero  end_if_zero1  end_if_zero2
0             0.0           3.0           3.0
1             2.0           0.0           0.0
2             0.0           4.0           4.0
3             0.0           5.0           5.0
4             0.0           4.0           4.0
5             0.0           3.0           3.0
6             0.0           5.0           5.0
7             0.0           6.0           6.0
8             3.0           6.0           6.0
9             2.0           1.0           0.0
10            4.0           0.0           0.0
11            5.0           2.0           1.0

We applied your logic to build columns 'x1' and 'x3'. We build 'x2' and 'x4' to name different segments.

df['x1'] = df['new_if_6_zero']==0
df['x2'] = df['x1'].diff().fillna(True).cumsum()

df['x3'] = (df['end_if_zero1']==0) & (df['end_if_zero2']==0)
df['x4'] = df['x3'].diff().fillna(True).cumsum()

    new_if_6_zero  end_if_zero1  end_if_zero2     x1  x2     x3  x4
0             0.0           3.0           3.0   True   1  False   1
1             2.0           0.0           0.0  False   2   True   2
2             0.0           4.0           4.0   True   3  False   3
3             0.0           5.0           5.0   True   3  False   3
4             0.0           4.0           4.0   True   3  False   3
5             0.0           3.0           3.0   True   3  False   3
6             0.0           5.0           5.0   True   3  False   3
7             0.0           6.0           6.0   True   3  False   3
8             3.0           6.0           6.0  False   4  False   3
9             2.0           1.0           0.0  False   4  False   3
10            4.0           0.0           0.0  False   4   True   4
11            5.0           2.0           1.0  False   4  False   5

Together 'x2' and 'x4' they form a set of unique segment keys, using which we check if any segment satisfies conditions for start or end.

trigger = df.groupby(['x2', 'x4']).agg({'x1': 'sum', 'x3': 'any'}).rename(columns={'x1': 'start_trigger', 'x3': 'end_trigger'})

       start_trigger  end_trigger
x2 x4                            
1  1               1        False
2  2               0         True
3  3               6        False
4  3               0        False
   4               0         True
   5               0        False

A valid trigger is one that satisfies your logic

valid_trigger = trigger[(trigger['start_trigger'].shift()>=6) & (trigger['end_trigger'].shift(-1))]

And we pull those segments out with the index remaining in valid_trigger

outcome = df.set_index(['x2', 'x4']).loc[valid_trigger.index]
print(outcome)

       new_if_6_zero  end_if_zero1  end_if_zero2     x1     x3
x2 x4                                                         
4  3             3.0           6.0           6.0  False  False
   3             2.0           1.0           0.0  False  False

You have these indices as names to separate one segment from another.

If you see problems applying my code in your dataset, besides describing the problems, please copy the input dataframe provided in my answer and paste it in the question, modify it with only necessary changes so that I can use it to reproduce that problem myself.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 matt cooper
Solution 2 Raymond Kwok