'Isolate 'events' from either an array or Dataframe
I have a dataset of rainfall values which is binned into one hour bins. the data is in either an array, or a dataframe, I will use whichever is easier to get the desired result.
The dataset is approx 7000 rows by 11 columns, and has nan values which need to stay.
I apologise for pasting such a large chunk of the df, but it is needed to ask the question. Apparently copying the code snippet and then running newest=pd.read_clipboard(sep='\s\s+') will give you a dataframe called 'newest' which will illustrate my question.
What I want to do is isolate the section between rows 73 and 96. this can be defined as a 'rain event' and can be identified as beginning when there are 6 consecutive zeros in the 'highpoint sum' column (rows 68 -73) before this column has positive values and ending when columns 'ic mean', 'mc mean', 'md mean' and 'imd mean' are all zero.
I have no idea if this is possible, or if i have explained it in a way that will satisfy the Stackoverflow police, and would be happy to discuss/explain further if anyone is so inclined!
Hopefully someone can help :)
I have tried, and failed, to put some of the data here in a way in which it can be exported. so i will now post a screenshot of a very small part of the dataframe which demonstrates the problem. I want to extract the subset between 2021-05-07 01:00 and 2021-05-09 03:00:00.
these blocks will appear throughout the dataframe, at non-regular intervals, and be different sizes.
datetime ic_mean mc_mean md_mean imd_mean oak_mean highpoint_sum school_sum open_sum bresser_sum open_mean
2021-05-07 21:00:00 0.0 0.0 0.0 0.0 0 0.0 0.0
2021-05-07 22:00:00 0.0 0.0 0.0 0.0 0 0.0 0.0
2021-05-07 23:00:00 0.0 0.0 0.0 0.0 0 0.0 0.0
2021-05-08 00:00:00 0.0 0.0 0.0 0.0 0 0.0 0.0
2021-05-08 01:00:00 0.0 0.0 0.0 0.0635000149999998 0 0.0 0.0
2021-05-08 02:00:00 0.0 0.0 0.6985001540000007 0.6985001537499997 1.2700002800000014 0.0 0.6350001400000007
2021-05-08 03:00:00 0.95250021125 0.8255001820000003 1.6510003682499992 1.651000366 2.54000057 0.0 1.270000285
2021-05-08 04:00:00 1.7145003815000002 2.667000590749999 2.9845006635000013 2.2225004947499993 3.8100008400000007 0.0 1.9050004200000004
2021-05-08 05:00:00 2.66700059125 3.17500070525 2.9845006594999983 2.6035005772500006 3.556000789999999 0.0 1.7780003949999994
2021-05-08 06:00:00 2.5400005629999995 2.6035005800000004 2.984500662750001 3.1750007055 4.572001019999998 0.0 2.286000509999999
2021-05-08 07:00:00 3.7465008315 3.3655007474999996 3.7465008330000007 3.36550074675 6.350001410000001 0.0 3.1750007050000004
2021-05-08 08:00:00 3.619500802 4.5720010125 4.3180009575 3.746500832499999 6.096001350000002 0.0 3.048000675000001
2021-05-08 09:00:00 2.73050061 3.5560007900000006 2.7940006225 2.34950052 3.8100008500000015 0.0 1.9050004250000008
2021-05-08 10:00:00 0.0 0.0
2021-05-08 11:00:00 1.968500439999998 1.9685004375000004 1.5875003524999993 2.095500465000002 2.032000449999998 0.0 1.016000224999999
2021-05-08 12:00:00 1.143000252500002 1.016000225 0.8890001949999986 1.3335002975000005 1.5240003400000006 0.0 0.7620001700000003
2021-05-08 13:00:00 0.5080001124999987 0.7620001699999985 0.5715001275000011 0.825500182499999 1.0160002200000022 0.0 0.5080001100000011
2021-05-08 14:00:00 0.25400005750000076 0.4445000975000024 0.31750007500000077 0.6350001400000007 0.25400006000000275 0.0 0.12700003000000137
2021-05-08 15:00:00 0.5715001274999993 0.3810000849999984 0.38100007999999796 0.7620001699999985 0.7620001700000003 0.19999999999998863 0.48100008499999447
2021-05-08 16:00:00 0.9525002100000011 1.1430002550000022 0.8255001850000019 1.0795002399999998 1.2700002799999979 0.0 0.6350001399999989
2021-05-08 17:00:00 0.25400005500000056 0.2540000549999988 0.12700002750000117 0.6350001400000007 0.25400006000000275 0.20000000000000284 0.2270000300000028
2021-05-08 18:00:00 1.7780003924999983 1.5875003524999993 1.3970003124999977 1.5240003400000006 2.286000499999993 0.0 1.1430002499999965
2021-05-08 19:00:00 1.7780003975000005 1.7780003949999994 1.3335002950000021 1.4605003249999982 2.032000449999998 0.20000000000000284 1.1160002250000005
2021-05-08 20:00:00 2.1590004800000004 2.4130005350000037 2.413000532499999 1.8415004074999999 3.556000789999999 0.0 1.7780003949999994
2021-05-08 21:00:00 1.7145003775000003 2.2225004899999963 1.6510003674999982 1.4605003224999997 2.286000510000008 0.0 1.143000255000004
2021-05-08 22:00:00 0.8890001999999999 0.8890002000000017 0.6985001550000005 0.8890001975000033 0.7620001699999932 0.0 0.3810000849999966
2021-05-08 23:00:00 0.2540000524999986 0.38100008500000015 0.31750006999999947 0.44450009999999907 0.7620001700000074 0.20000000000000284 0.48100008500000513
2021-05-09 00:00:00 0.0 0.1905000424999983 0.12700003000000137 0.1905000399999981 0 0.3999999999999915 0.19999999999999574
2021-05-09 01:00:00 0.0 0.0 0.0 0.0 0 0.4000000000000057 0.20000000000000284
2021-05-09 02:00:00 0.1270000299999996 0.06350001499999891 0.0 0.06350001500000246 0 0.0 0.0
2021-05-09 03:00:00 0.0 0.0 0.0 0.06350001499999891 0 0.4000000000000057 0.20000000000000284
2021-05-09 04:00:00 0.0 0.0 0.0 0.0 0 0.0 0.0
2021-05-09 05:00:00 0.0 0.0 0.0 0.06350001250000048 0 0.0 0.0
2021-05-09 06:00:00 0.0 0.0 0.0 0.06350001499999891 0 0.0 0.0
2021-05-09 07:00:00 0.0 0.0 0.0 0.0 0 0.0 0.0
2021-05-09 08:00:00 0.0 0.0 0.0 0.06350001250000048 0 0.0 0.0
Solution 1:[1]
The following code is how I have solved this question, It produces a dataframe 'eventdf' which has the start and end index of each event in the original dataframe.
exec(open ('int_no_plots.py').read()) #runs int_np_plots to produce meandf
meandf2=meandf.copy()
meandf2.reset_index(inplace=True)
meandf2.drop(meandf2.iloc[:,12:42], inplace=True, axis=1)
col = meandf2["highpoint_sum"]
delta1 = int(input("dry hours before event?: "))
delta2 = int(input("dry hours after event?: "))###make this one more than delta1
cl=len(col) ##number of rain free hours
rowlist=[]
seglist=[]
sum_delta =np.zeros((cl,3))
for i in range(int(len(col))):
segment = col[i: i+delta1]
segdf=segment.to_frame()
segdf['index']=segdf.index
sum_delta[i][0]=i
segsum=segment.sum()
sum_delta[i][1]=segsum
sum_delta[i][2]=segdf.iloc[0,1]
segsum2 = segment.iloc[0:delta1-1].sum()
seglist.append(segment)
if len(segment)>delta1-1:
if segsum > 0 and segment.iloc[0:delta1-1].sum()==0:
rowlist.append(segdf.iloc[delta1-1,1])
continue
rowlistx=[]
seglistx=[]
sum_deltax =np.zeros((cl,3))
for i in range(int(len(col))):
segmentx = col[i: i+delta2]
segdfx=segmentx.to_frame()
segdfx['index']=segdfx.index
sum_deltax[i][0]=i
segsumx=segmentx.sum()
sum_deltax[i][1]=segsumx
sum_deltax[i][2]=segdfx.iloc[0,1]
segsum2x = segmentx.iloc[1:delta2].sum()
seglistx.append(segmentx)
if len(segmentx) > delta2-1:
if segsumx > 0 and segmentx.iloc[1:delta2].sum()==0:
rowlistx.append(segdfx.iloc[delta2-1,1]) ###gives index 8 hours after rain stops to allow for drip through
continue
if len(rowlist) > len (rowlistx):
rowlist.pop()
eventdf=pd.DataFrame(rowlist, columns=['start_line'])
eventdf['end_line']=rowlistx
Solution 2:[2]
To demo the idea, here is my input dataframe,
df = pd.DataFrame({
'new_if_6_zero':
[0., 2, 0, 0, 0, 0, 0, 0, 3, 2, 4, 5],
'end_if_zero1':
[3., 0, 4, 5, 4, 3, 5, 6, 6, 1, 0, 2],
'end_if_zero2':
[3., 0, 4, 5, 4, 3, 5, 6, 6, 0, 0, 1],
})
print(df)
new_if_6_zero end_if_zero1 end_if_zero2
0 0.0 3.0 3.0
1 2.0 0.0 0.0
2 0.0 4.0 4.0
3 0.0 5.0 5.0
4 0.0 4.0 4.0
5 0.0 3.0 3.0
6 0.0 5.0 5.0
7 0.0 6.0 6.0
8 3.0 6.0 6.0
9 2.0 1.0 0.0
10 4.0 0.0 0.0
11 5.0 2.0 1.0
We applied your logic to build columns 'x1' and 'x3'. We build 'x2' and 'x4' to name different segments.
df['x1'] = df['new_if_6_zero']==0
df['x2'] = df['x1'].diff().fillna(True).cumsum()
df['x3'] = (df['end_if_zero1']==0) & (df['end_if_zero2']==0)
df['x4'] = df['x3'].diff().fillna(True).cumsum()
new_if_6_zero end_if_zero1 end_if_zero2 x1 x2 x3 x4
0 0.0 3.0 3.0 True 1 False 1
1 2.0 0.0 0.0 False 2 True 2
2 0.0 4.0 4.0 True 3 False 3
3 0.0 5.0 5.0 True 3 False 3
4 0.0 4.0 4.0 True 3 False 3
5 0.0 3.0 3.0 True 3 False 3
6 0.0 5.0 5.0 True 3 False 3
7 0.0 6.0 6.0 True 3 False 3
8 3.0 6.0 6.0 False 4 False 3
9 2.0 1.0 0.0 False 4 False 3
10 4.0 0.0 0.0 False 4 True 4
11 5.0 2.0 1.0 False 4 False 5
Together 'x2' and 'x4' they form a set of unique segment keys, using which we check if any segment satisfies conditions for start or end.
trigger = df.groupby(['x2', 'x4']).agg({'x1': 'sum', 'x3': 'any'}).rename(columns={'x1': 'start_trigger', 'x3': 'end_trigger'})
start_trigger end_trigger
x2 x4
1 1 1 False
2 2 0 True
3 3 6 False
4 3 0 False
4 0 True
5 0 False
A valid trigger is one that satisfies your logic
valid_trigger = trigger[(trigger['start_trigger'].shift()>=6) & (trigger['end_trigger'].shift(-1))]
And we pull those segments out with the index remaining in valid_trigger
outcome = df.set_index(['x2', 'x4']).loc[valid_trigger.index]
print(outcome)
new_if_6_zero end_if_zero1 end_if_zero2 x1 x3
x2 x4
4 3 3.0 6.0 6.0 False False
3 2.0 1.0 0.0 False False
You have these indices as names to separate one segment from another.
If you see problems applying my code in your dataset, besides describing the problems, please copy the input dataframe provided in my answer and paste it in the question, modify it with only necessary changes so that I can use it to reproduce that problem myself.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | matt cooper |
| Solution 2 | Raymond Kwok |
