'How to flatten a dataframe by a column containing ranges
Input dataframe:
df=
pd.DataFrame(columns=['id', 'antibiotic','start_date', 'end_date'],
data=[['Sophie', 'amoxicillin', 15, 17],
['Sophie', 'doxycycline', 19, 21],
['Sophie', 'amoxicillin', 20, 22],
['Robert', 'cephalexin', 12, 14],
['Robert', 'ciprofloxacin', 17, 18],
['Robert', 'clindamycin', 18, 18],
['Robert', 'cephalexin', 17, 19]
])
I would like to flatten out/expand the dates, and also join ('/') the antibiotics fields when they concur in the same date. like below:
df_flat=
pd.DataFrame(columns=['id', 'date', 'antibiotic'],
data=[['Sophie', 15, 'amoxicillin'],
['Sophie', 16, 'amoxicillin'],
['Sophie', 17, 'amoxicillin'],
['Sophie', 18, NaN],
['Sophie', 19, 'doxycycline'],
['Sophie', 20, 'doxycycline/amoxicillin'],
['Sophie', 21, 'doxycycline/amoxicillin'],
['Sophie', 22, 'amoxicillin'],
['Robert', 12, 'cephalexin'],
['Robert', 13, 'cephalexin'],
['Robert', 14, 'cephalexin'],
['Robert', 15, NaN],
['Robert', 16, NaN],
['Robert', 17, 'ciprofloxacin/cephalexin'],
['Robert', 18, 'ciprofloxacin/clindamycin/cephalexin'],
['Robert', 19, 'cephalexin']
])
What I'm trying...
#get mixmax of dates
minmax = df.groupby('id').agg({'start_date':min,'end_date':max})
#create multiindex manually with mins and maxes
multi_index = []
for i, row in minmax.iterrows():
for d in range(row.start_date, row.end_date):
tup = (i, d)
multi_index.append(tup)
# create output dataframe with this multiindex
df_flat = pd.DataFrame(index=pd.MultiIndex.from_tuples(multi_index),\
columns=['date','antibiotics'])
# And then fill up this df_flat using the original dataframe by matching the index of df_flat
# with the date values of original df, in a for loop :)
for i, row in df.iterrows():
for tup in multi_index:
if tup[1]>=row.start_date & ...
.
.
.
but this seems inefficient and inelegant. I'm sure something more smart can be done.
Solution 1:[1]
One option is to generate a range for each row, explode to create one row per date, then aggregate per id/date:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
.explode('date')
.groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
.reset_index()
)
output:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 17 ciprofloxacin/cephalexin
4 Robert 18 ciprofloxacin/clindamycin/cephalexin
5 Robert 19 cephalexin
6 Sophie 15 amoxicillin
7 Sophie 16 amoxicillin
8 Sophie 17 amoxicillin
9 Sophie 19 doxycycline
10 Sophie 20 doxycycline/amoxicillin
11 Sophie 21 doxycycline/amoxicillin
12 Sophie 22 amoxicillin
keeping the NaNs:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
.explode('date')
.groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
.reset_index(level=0)
.groupby('id')['antibiotic']
.apply(lambda g: g.reindex(range(g.index.min(), g.index.max()+1)))
.reset_index()
)
output:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 15 NaN
4 Robert 16 NaN
5 Robert 17 ciprofloxacin/cephalexin
6 Robert 18 ciprofloxacin/clindamycin/cephalexin
7 Robert 19 cephalexin
8 Sophie 15 amoxicillin
9 Sophie 16 amoxicillin
10 Sophie 17 amoxicillin
11 Sophie 18 NaN
12 Sophie 19 doxycycline
13 Sophie 20 doxycycline/amoxicillin
14 Sophie 21 doxycycline/amoxicillin
15 Sophie 22 amoxicillin
alternative to have all days for all patients:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
.explode('date')
.groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
.unstack('date')
.stack('date', dropna=False).rename('antibiotic')
.reset_index()
)
output:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 15 NaN
4 Robert 16 NaN
5 Robert 17 ciprofloxacin/cephalexin
6 Robert 18 ciprofloxacin/clindamycin/cephalexin
7 Robert 19 cephalexin
8 Robert 20 NaN
9 Robert 21 NaN
10 Robert 22 NaN
11 Sophie 12 NaN
12 Sophie 13 NaN
13 Sophie 14 NaN
14 Sophie 15 amoxicillin
15 Sophie 16 amoxicillin
16 Sophie 17 amoxicillin
17 Sophie 18 NaN
18 Sophie 19 doxycycline
19 Sophie 20 doxycycline/amoxicillin
20 Sophie 21 doxycycline/amoxicillin
21 Sophie 22 amoxicillin
Solution 2:[2]
First is repeat subtracted times by Index.repeat, then add counter to start_date column with agregate join, last add missing ranges:
df = df.loc[df.index.repeat(df['end_date'].sub(df['start_date']).add(1))].copy()
df['date'] = df['start_date'].add(df.groupby(level=0).cumcount())
df = (df.groupby(['id','date'], sort=False)['antibiotic'].agg('/'.join)
.reset_index(level=0)
.groupby('id', sort=False)['antibiotic']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max()+1)))
.reset_index()
)
print (df)
id date antibiotic
0 Sophie 15 amoxicillin
1 Sophie 16 amoxicillin
2 Sophie 17 amoxicillin
3 Sophie 18 NaN
4 Sophie 19 doxycycline
5 Sophie 20 doxycycline/amoxicillin
6 Sophie 21 doxycycline/amoxicillin
7 Sophie 22 amoxicillin
8 Robert 12 cephalexin
9 Robert 13 cephalexin
10 Robert 14 cephalexin
11 Robert 15 NaN
12 Robert 16 NaN
13 Robert 17 ciprofloxacin/cephalexin
14 Robert 18 ciprofloxacin/clindamycin/cephalexin
15 Robert 19 cephalexin
Solution 3:[3]
You could also cross-merge against your own data, then subset back to the matches before group-by and then joining.
df = df.merge(
pd.RangeIndex(df['start_date'].min(), df['end_date'].max() + 1).to_series().rename('date'),
how='cross')
df = df[df['date'] <= df['end_date']]
df = df[df['date'] >= df['start_date']]
# df.sort_values(by=['id', 'date'], inplace=True)
final = df.groupby(['id', 'date'])['antibiotic'] \
.agg('/'.join) \
.reset_index()
Producing:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 17 ciprofloxacin/cephalexin
4 Robert 18 cephalexin/ciprofloxacin/clindamycin
5 Robert 19 cephalexin
6 Sophie 15 amoxicillin
7 Sophie 16 amoxicillin
8 Sophie 17 amoxicillin
9 Sophie 19 doxycycline
10 Sophie 20 doxycycline/amoxicillin
11 Sophie 21 doxycycline/amoxicillin
12 Sophie 22 amoxicillin
Solution 4:[4]
Use:
df['temp']=list(zip(df.start_date, df.end_date))
df.explode('temp').groupby(['id', 'temp'])['antibiotic'].apply(comb)
def comb(row):
out = ''
for part in row:
out+=part
out+='/'
return out[:-1]
First you combine the dates columns. Then you explode on the new column. Then you need to groupby using the custom function.
Result
id temp
Robert 12 cephalexin
14 cephalexin
17 ciprofloxacin/cephalexin
18 ciprofloxacin/clindamycin/clindamycin
19 cephalexin
Sophie 25 amoxicillin
27 amoxicillin
29 doxycycline
30 amoxicillin
31 doxycycline
32 amoxicillin
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | ifly6 |
| Solution 4 | Pab |
