'How to flatten a dataframe by a column containing ranges

Input dataframe:

df=
pd.DataFrame(columns=['id', 'antibiotic','start_date', 'end_date'],

                data=[['Sophie', 'amoxicillin', 15, 17],
                      ['Sophie', 'doxycycline', 19, 21],
                      ['Sophie', 'amoxicillin', 20, 22],
                      ['Robert', 'cephalexin', 12, 14],
                      ['Robert', 'ciprofloxacin', 17, 18],
                      ['Robert', 'clindamycin', 18, 18],
                      ['Robert', 'cephalexin', 17, 19]
])

I would like to flatten out/expand the dates, and also join ('/') the antibiotics fields when they concur in the same date. like below:

df_flat=
pd.DataFrame(columns=['id',     'date', 'antibiotic'],

                data=[['Sophie', 15, 'amoxicillin'],
                      ['Sophie', 16, 'amoxicillin'],
                      ['Sophie', 17, 'amoxicillin'],
                      ['Sophie', 18, NaN],
                      ['Sophie', 19, 'doxycycline'],
                      ['Sophie', 20, 'doxycycline/amoxicillin'],
                      ['Sophie', 21, 'doxycycline/amoxicillin'],
                      ['Sophie', 22, 'amoxicillin'],

                      ['Robert', 12, 'cephalexin'],
                      ['Robert', 13, 'cephalexin'],
                      ['Robert', 14, 'cephalexin'],
                      ['Robert', 15, NaN],
                      ['Robert', 16, NaN],
                      ['Robert', 17, 'ciprofloxacin/cephalexin'],
                      ['Robert', 18, 'ciprofloxacin/clindamycin/cephalexin'],
                      ['Robert', 19, 'cephalexin']
])

What I'm trying...

#get mixmax of dates
minmax = df.groupby('id').agg({'start_date':min,'end_date':max})

#create multiindex manually with mins and maxes
multi_index = []
for i, row in minmax.iterrows():
    for d in range(row.start_date, row.end_date):
        tup = (i, d)
        multi_index.append(tup)

# create output dataframe with this multiindex
df_flat = pd.DataFrame(index=pd.MultiIndex.from_tuples(multi_index),\
 columns=['date','antibiotics'])

# And then fill up this df_flat using the original dataframe by matching the index of df_flat 
# with the date values of original df, in a for loop :)

for i, row in df.iterrows():
    for tup in multi_index:
        if tup[1]>=row.start_date & ...
        .
        .
        .

but this seems inefficient and inelegant. I'm sure something more smart can be done.



Solution 1:[1]

One option is to generate a range for each row, explode to create one row per date, then aggregate per id/date:

(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
   .explode('date')
   .groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
   .reset_index()
)

output:

        id  date                            antibiotic
0   Robert    12                            cephalexin
1   Robert    13                            cephalexin
2   Robert    14                            cephalexin
3   Robert    17              ciprofloxacin/cephalexin
4   Robert    18  ciprofloxacin/clindamycin/cephalexin
5   Robert    19                            cephalexin
6   Sophie    15                           amoxicillin
7   Sophie    16                           amoxicillin
8   Sophie    17                           amoxicillin
9   Sophie    19                           doxycycline
10  Sophie    20               doxycycline/amoxicillin
11  Sophie    21               doxycycline/amoxicillin
12  Sophie    22                           amoxicillin
keeping the NaNs:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
   .explode('date')
   .groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
   .reset_index(level=0)
   .groupby('id')['antibiotic']
   .apply(lambda g: g.reindex(range(g.index.min(), g.index.max()+1)))
   .reset_index()
)

output:

        id  date                            antibiotic
0   Robert    12                            cephalexin
1   Robert    13                            cephalexin
2   Robert    14                            cephalexin
3   Robert    15                                   NaN
4   Robert    16                                   NaN
5   Robert    17              ciprofloxacin/cephalexin
6   Robert    18  ciprofloxacin/clindamycin/cephalexin
7   Robert    19                            cephalexin
8   Sophie    15                           amoxicillin
9   Sophie    16                           amoxicillin
10  Sophie    17                           amoxicillin
11  Sophie    18                                   NaN
12  Sophie    19                           doxycycline
13  Sophie    20               doxycycline/amoxicillin
14  Sophie    21               doxycycline/amoxicillin
15  Sophie    22                           amoxicillin

alternative to have all days for all patients:

(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
   .explode('date')
   .groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
   .unstack('date')
   .stack('date', dropna=False).rename('antibiotic')
   .reset_index()
)

output:

        id  date                            antibiotic
0   Robert    12                            cephalexin
1   Robert    13                            cephalexin
2   Robert    14                            cephalexin
3   Robert    15                                   NaN
4   Robert    16                                   NaN
5   Robert    17              ciprofloxacin/cephalexin
6   Robert    18  ciprofloxacin/clindamycin/cephalexin
7   Robert    19                            cephalexin
8   Robert    20                                   NaN
9   Robert    21                                   NaN
10  Robert    22                                   NaN
11  Sophie    12                                   NaN
12  Sophie    13                                   NaN
13  Sophie    14                                   NaN
14  Sophie    15                           amoxicillin
15  Sophie    16                           amoxicillin
16  Sophie    17                           amoxicillin
17  Sophie    18                                   NaN
18  Sophie    19                           doxycycline
19  Sophie    20               doxycycline/amoxicillin
20  Sophie    21               doxycycline/amoxicillin
21  Sophie    22                           amoxicillin

Solution 2:[2]

First is repeat subtracted times by Index.repeat, then add counter to start_date column with agregate join, last add missing ranges:

df = df.loc[df.index.repeat(df['end_date'].sub(df['start_date']).add(1))].copy()
df['date'] = df['start_date'].add(df.groupby(level=0).cumcount())

df = (df.groupby(['id','date'], sort=False)['antibiotic'].agg('/'.join)
        .reset_index(level=0)
        .groupby('id', sort=False)['antibiotic']
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max()+1)))
        .reset_index()
        )
       
print (df)
        id  date                            antibiotic
0   Sophie    15                           amoxicillin
1   Sophie    16                           amoxicillin
2   Sophie    17                           amoxicillin
3   Sophie    18                                   NaN
4   Sophie    19                           doxycycline
5   Sophie    20               doxycycline/amoxicillin
6   Sophie    21               doxycycline/amoxicillin
7   Sophie    22                           amoxicillin
8   Robert    12                            cephalexin
9   Robert    13                            cephalexin
10  Robert    14                            cephalexin
11  Robert    15                                   NaN
12  Robert    16                                   NaN
13  Robert    17              ciprofloxacin/cephalexin
14  Robert    18  ciprofloxacin/clindamycin/cephalexin
15  Robert    19                            cephalexin

Solution 3:[3]

You could also cross-merge against your own data, then subset back to the matches before group-by and then joining.

df = df.merge(
    pd.RangeIndex(df['start_date'].min(), df['end_date'].max() + 1).to_series().rename('date'),
    how='cross')
df = df[df['date'] <= df['end_date']]
df = df[df['date'] >= df['start_date']]
# df.sort_values(by=['id', 'date'], inplace=True)

final = df.groupby(['id', 'date'])['antibiotic'] \
    .agg('/'.join) \
    .reset_index()

Producing:

        id  date                            antibiotic
0   Robert    12                            cephalexin
1   Robert    13                            cephalexin
2   Robert    14                            cephalexin
3   Robert    17              ciprofloxacin/cephalexin
4   Robert    18  cephalexin/ciprofloxacin/clindamycin
5   Robert    19                            cephalexin
6   Sophie    15                           amoxicillin
7   Sophie    16                           amoxicillin
8   Sophie    17                           amoxicillin
9   Sophie    19                           doxycycline
10  Sophie    20               doxycycline/amoxicillin
11  Sophie    21               doxycycline/amoxicillin
12  Sophie    22                           amoxicillin

Solution 4:[4]

Use:

df['temp']=list(zip(df.start_date, df.end_date))
df.explode('temp').groupby(['id', 'temp'])['antibiotic'].apply(comb)
def comb(row):
    out = ''
    for part in row:
        out+=part
        out+='/'
    return out[:-1]

First you combine the dates columns. Then you explode on the new column. Then you need to groupby using the custom function.

Result

id      temp
Robert  12                                 cephalexin
        14                                 cephalexin
        17                   ciprofloxacin/cephalexin
        18      ciprofloxacin/clindamycin/clindamycin
        19                                 cephalexin
Sophie  25                                amoxicillin
        27                                amoxicillin
        29                                doxycycline
        30                                amoxicillin
        31                                doxycycline
        32                                amoxicillin

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 ifly6
Solution 4 Pab