'How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.

original dataframe:

   Case       Date
0     1 2010-01-01
1     2 2011-04-01
2     3 2012-08-01

after populating dates:

   Case        Date
0     1  2010-02-01
1     1  2010-03-01
2     1  2010-04-01
3     2  2011-05-01
4     2  2011-06-01
5     2  2011-07-01
6     3  2012-09-01
7     3  2012-10-01
8     3  2012-11-01

I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.

import pandas as pd

data = [[1, '2010-01-01'],  [2, '2011-04-01'], [3, '2012-08-01']]
 
df = pd.DataFrame(data, columns = ['Case', 'Date'])

df.Date = pd.to_datetime(df.Date)

df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])

month_num = 3

for c in df.Case:
    for m in range(1, month_num+1):
        temp = df.loc[df['Case']==c]
        temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
        df_new = pd.concat([df_new, temp]) 

df_new.reset_index(inplace=True, drop=True)

My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!



Solution 1:[1]

Given your input data this is what worked on my notebook:

df2=pd.DataFrame()

df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()

df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
  • We create a df2 to which we populate 'Date' with the needed dates coming from the original df
  • Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
  • Finally , we offset the resulting column off 1 day

Solution 2:[2]

Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.

As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Daniel Weigel
Solution 2 mattiatantardini