'Fill in missing dates for a pandas dataframe with multiple series
I have a dataframe that contains multiple time series, like so:
| Date | Item | Category |
|---|---|---|
| 2021-01-01 | gouda | cheese |
| 2021-01-02 | gouda | cheese |
| 2021-01-04 | gouda | cheese |
| 2021-01-05 | gouda | cheese |
| 2021-02-01 | lettuce | produce |
| 2021-02-02 | lettuce | produce |
| 2021-02-03 | lettuce | produce |
| 2021-02-05 | lettuce | produce |
I'd like to add rows for the missing dates (ex. 2021-01-03 for gouda, 2021-02-04 for lettuce). Note that these series do not necessarily start and end on the same date.
What is the best way to do this in pandas? I'd also like fill the new rows with the values in the "item" and "category" column for that series.
Solution 1:[1]
Group by Item and Category, then generate a time series from the min to the max date:
result = (
df.groupby(["Item", "Category"])["Date"]
.apply(lambda s: pd.date_range(s.min(), s.max()))
.explode()
.reset_index()
)
Solution 2:[2]
You can do resample
df['Date'] = pd.to_datetime(df['Date'])
df['Y-m'] = df['Date'].dt.strftime('%y-%m')
out = df.groupby('Y-m').apply(lambda x : x.set_index('Date').resample('D').ffill()).reset_index(level=1)
Solution 3:[3]
This is far from optimal, but it is how I would do in order to ensure all categories and items are within the min and max periods, and all ranges are filled:
aux = []
for x in df['Item'].unique():
_ = pd.DataFrame({'Date':pd.date_range(df[df['Item']==x]['Date'].min(),df[df['Item']==x]['Date'].max(),freq='d')})
_['Item'] = x
_['Category'] = df[df['Item']==x]['Category'].values[0]
aux.append(_)
output = pd.concat(aux)
Consider this sample dataset:
df = pd.DataFrame({'Date':['2021-01-01','2021-01-02','2021-01-04','2021-01-05','2021-01-01','2021-01-02','2021-01-04','2021-01-05'],
'Item':['gouda','gouda','gouda','gouda','lettuce','lettuce','lettuce','lettuce'],
'Category':['cheese','cheese','cheese','cheese','produce','produce','produce','produce']})
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
Outputs:
Date Item Category
0 2021-01-01 gouda cheese
1 2021-01-02 gouda cheese
2 2021-01-03 gouda cheese
3 2021-01-04 gouda cheese
4 2021-01-05 gouda cheese
0 2021-01-01 lettuce produce
1 2021-01-02 lettuce produce
2 2021-01-03 lettuce produce
3 2021-01-04 lettuce produce
4 2021-01-05 lettuce produce
Solution 4:[4]
One option is with the complete function from pyjanitor to explicitly generate missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
df.complete(
{'Date': lambda date: pd.date_range(date.min(), date.max())},
by = ['Item', 'Category'],
sort = True)
Date Item Category
0 2021-01-01 gouda cheese
1 2021-01-02 gouda cheese
2 2021-01-03 gouda cheese
3 2021-01-04 gouda cheese
4 2021-01-05 gouda cheese
5 2021-02-01 lettuce produce
6 2021-02-02 lettuce produce
7 2021-02-03 lettuce produce
8 2021-02-04 lettuce produce
9 2021-02-05 lettuce produce
The dictionary helps introduce values into the dataframe. The key of the dataframe should be an existing column; the lambda function refers to the Date column
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | BENY |
| Solution 3 | Celius Stingher |
| Solution 4 | sammywemmy |
