'Merge 2 pandas dataframes conditional by upper bound (float) and lower bound (float) and merge_asof by date
According to this post, it looks similar to my problem but I tried every solutions in that post and it cannot solve my problem due to different data type (e.g. earlier post has lower bound and upper bound as date but mine are float) and the way to merge (the earlier post use outer merge to get exacy match while my problem need merge_asof with backward direction). Let me explain my problem again. I have a sample dataframe as follows
df = pd.DataFrame({"date": ["2022-04-10", "2022-04-11", "2022-04-12", "2022-04-13"],
"occupancy": [30, 55, 28, 100]})
df["date"] = pd.to_datetime(df["date"])
so my dataframe look like this
index date occupancy
0 2022-04-10 30
1 2022-04-11 55
2 2022-04-12 28
3 2022-04-13 100
There is another dataframe that represents predefined rules as follows.
rules = pd.DataFrame({"lower": [0, 25, 50, 75, 25],
"upper": [25, 50, 75, 100, 50],
"action": [-15, -5, 20, 25, -10],
"filedate": ["2022-04-01", "2022-04-01", "2022-04-01", "2022-04-01", "2022-04-12"]})
rules["filedate"] = pd.to_datetime(rules["filedate"])
The rules look like this.
index lower upper action filedate
0 0 25 -15 2022-04-01
1 25 50 -5 2022-04-01
2 50 75 20 2022-04-01
3 75 100 25 2022-04-01
4 25 50 -10 2022-04-12
I would like to take the action column and add it to the df. Merging conditions are, lower <= occupancy < upper and filedate <= date but the rule is selected based on the nearest filedate that is less than date. The expected output should be the following.
index date occupancy action
0 2022-04-10 30 -5
1 2022-04-11 55 20
2 2022-04-12 28 -10
3 2022-04-13 100 NaN
Explanations of the above output are the following. For example, row index 0 of df, its occupancy lies between lower (25) and upper (50). There are 2 rows in the rules those satisfy such a condition (row index 1 and index row 4). However, action of row index 1 will be selected because the filedate 2022-04-01 is less than the date 2022-04-10 according to the condition filedate <= date. We can think that the filedate is an effective date to execute action.
Another example, row index 2 of df, its occupancy lies between lower (25) and upper (50). There are 2 rows in the rules those satisfy such a condition (row index 1 and index row 4) but the action of row index 4 will be selected because the filedate 2022-04-12 is less than or equal to the date 2022-04-12 according to the condition filedate <= date but the rule is selected based on the nearest filedate that is less than date.
I think my problem need to use merge_asof but I could not figure out how to effectively implement it. May I have your suggestions?
Solution 1:[1]
You could write a custom function to do the filtering and apply it to each row of your DataFrame:
def custom_merge(row):
sample = rules[rules["lower"].le(row.at["occupancy"])&rules["upper"].gt(row.at["occupancy"])&rules["filedate"].le(row.at["date"])]
if sample.shape[0] > 0:
action = sample.loc[sample["filedate"].sub(row["date"]).abs().idxmin()]["action"]
else:
action = np.nan
return action
df["action"] = df.apply(custom_merge, axis=1)
>>> df
date occupancy action
0 2022-04-10 30 -5.0
1 2022-04-11 55 20.0
2 2022-04-12 28 -10.0
3 2022-04-13 100 NaN
Solution 2:[2]
Ok. I can solve my problem now. It is not an efficient way but at least the problem solved. If there is the better way in term of efficiency, please advise.
df = pd.DataFrame({"date": ["2022-04-10", "2022-04-11", "2022-04-12", "2022-04-13"],
"occupancy": [30, 55, 28, 100]})
df["date"] = pd.to_datetime(df["date"])
rules = pd.DataFrame({"lower": [0, 25, 50, 75, 25],
"upper": [25, 50, 75, 100, 50],
"action": [-15, -5, 20, 25, -10],
"filedate": ["2022-04-01", "2022-04-01", "2022-04-01", "2022-04-01", "2022-04-12"]})
rules["filedate"] = pd.to_datetime(rules["filedate"])
actions = []
for i in df.index:
date = df.loc[i, "date"]
occ = df.loc[i, "occupancy"]
x = pd.DataFrame(df.loc[i, :]).T
subrules = rules[rules["filedate"] <= date]
mask = np.logical_and(subrules["lower"] <= occ, subrules["upper"] > occ)
temp = subrules[mask.values]
if temp.empty:
actions.append(np.nan)
else:
actions.append(pd.merge_asof(x, temp, left_on="date", right_on="filedate", direction="backward")["action"].values[0])
df["action"] = actions
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | not_speshal |
| Solution 2 | Ratchainant Thammasudjarit |
