'Fill in missing dates for a pandas dataframe with multiple series

I have a dataframe that contains multiple time series, like so:

Date	Item	Category
2021-01-01	gouda	cheese
2021-01-02	gouda	cheese
2021-01-04	gouda	cheese
2021-01-05	gouda	cheese
2021-02-01	lettuce	produce
2021-02-02	lettuce	produce
2021-02-03	lettuce	produce
2021-02-05	lettuce	produce

I'd like to add rows for the missing dates (ex. 2021-01-03 for gouda, 2021-02-04 for lettuce). Note that these series do not necessarily start and end on the same date.

What is the best way to do this in pandas? I'd also like fill the new rows with the values in the "item" and "category" column for that series.

python pandas

Solution 1:^[1]

Group by Item and Category, then generate a time series from the min to the max date:

result = (
    df.groupby(["Item", "Category"])["Date"]
    .apply(lambda s: pd.date_range(s.min(), s.max()))
    .explode()
    .reset_index()
)

Solution 2:^[2]

You can do resample

df['Date'] = pd.to_datetime(df['Date'])
df['Y-m'] = df['Date'].dt.strftime('%y-%m')
out = df.groupby('Y-m').apply(lambda x : x.set_index('Date').resample('D').ffill()).reset_index(level=1)

Solution 3:^[3]

This is far from optimal, but it is how I would do in order to ensure all categories and items are within the min and max periods, and all ranges are filled:

aux = []
for x in df['Item'].unique():
  _ = pd.DataFrame({'Date':pd.date_range(df[df['Item']==x]['Date'].min(),df[df['Item']==x]['Date'].max(),freq='d')})
  _['Item'] = x
  _['Category'] = df[df['Item']==x]['Category'].values[0]
  aux.append(_)
output = pd.concat(aux)

Consider this sample dataset:

df = pd.DataFrame({'Date':['2021-01-01','2021-01-02','2021-01-04','2021-01-05','2021-01-01','2021-01-02','2021-01-04','2021-01-05'],
                   'Item':['gouda','gouda','gouda','gouda','lettuce','lettuce','lettuce','lettuce'],
                   'Category':['cheese','cheese','cheese','cheese','produce','produce','produce','produce']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)

Outputs:

        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
0 2021-01-01  lettuce  produce
1 2021-01-02  lettuce  produce
2 2021-01-03  lettuce  produce
3 2021-01-04  lettuce  produce
4 2021-01-05  lettuce  produce

Solution 4:^[4]

One option is with the complete function from pyjanitor to explicitly generate missing rows:

# pip install pyjanitor
import pandas as pd
import janitor

df.complete(
    {'Date': lambda date: pd.date_range(date.min(), date.max())}, 
    by = ['Item', 'Category'], 
    sort = True)
 
        Date     Item Category
0 2021-01-01    gouda   cheese
1 2021-01-02    gouda   cheese
2 2021-01-03    gouda   cheese
3 2021-01-04    gouda   cheese
4 2021-01-05    gouda   cheese
5 2021-02-01  lettuce  produce
6 2021-02-02  lettuce  produce
7 2021-02-03  lettuce  produce
8 2021-02-04  lettuce  produce
9 2021-02-05  lettuce  produce

The dictionary helps introduce values into the dataframe. The key of the dataframe should be an existing column; the lambda function refers to the Date column

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	BENY
Solution 3	Celius Stingher
Solution 4	sammywemmy

'Fill in missing dates for a pandas dataframe with multiple series

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]