'How to merge two dfs in pandas (based on datetime period), and add rows if duplicates
I have the following 2 dfs:
diag
| id | encounter_key | start_of_period | end_of_period |
|---|---|---|---|
| 1 | AAA | 2020-06-12 | 2021-07-07 |
| 1 | BBB | 2021-12-31 | 2022-01-04 |
drug
| id | start_datetime | drug |
|---|---|---|
| 1 | 2020-06-16 | Mel |
| 1 | 2020-06-18 | Mel |
| 1 | 2020-06-18 | Flu |
| 1 | 2022-01-01 | Mel |
I want to combine (?merge/?join/?concatenate) the cols of drug where the start_datetime is within the start and end periods (inclusive) of diag, ending up with more rows in diag like:
| id | encounter_key | start_of_period | end_of_period | drug | start_datetime |
|---|---|---|---|---|---|
| 1 | AAA | 2020-06-12 | 2021-07-07 | Mel | 2020-06-16 |
| 1 | AAA | 2020-06-12 | 2021-07-07 | Mel | 2020-06-18 |
| 1 | AAA | 2020-06-12 | 2021-07-07 | Flu | 2020-06-18 |
| 1 | BBB | 2021-12-31 | 2022-01-04 | Mel | 2022-01-01 |
Hope this makes sense and apologies for not using the correct terms - I'm unsure of them. Thanks in advance.
Solution 1:[1]
I duplicate the columns the required number of times, and connect the dataframe again. Then connecting both dataframes together. Perhaps someone will offer a better solution.
out = diag[1:]
diag = pd.DataFrame(np.repeat(diag.values[:1], 3, axis=0), columns=diag.columns).astype(diag.dtypes)
diag = diag.append(out, ignore_index=True)
df = pd.concat([diag, drug], axis=1)
df = df.loc[:,~df.columns.duplicated()]
df = df.reindex(columns=['id', 'encounter_key', 'start_of_period', 'end_of_period', 'drug', 'start_datetime'])
Output
id encounter_key start_of_period end_of_period drug start_datetime
0 1 AAA 2020-06-12 2021-07-07 Mel 2020-06-16
1 1 AAA 2020-06-12 2021-07-07 Mel 2020-06-18
2 1 AAA 2020-06-12 2021-07-07 Flu 2020-06-18
3 1 BBB 2021-12-31 2022-01-04 Mel 2022-01-01
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
