'Filling missing dates on a DataFrame across different groups
Let's say I have the following DataFrame in Pandas
| date | customer | attended |
|---|---|---|
| 2022-01-01 | John | True |
| 2022-01-02 | John | True |
| 2022-01-04 | John | True |
| 2022-01-05 | Mark | True |
what transformations could I do to fill in the missing gaps within the dates on a given frequency (daily in this case, but it can by any other) so it ends up like this:
| date | customer | attended |
|---|---|---|
| 2022-01-01 | John | True |
| 2022-01-01 | Mark | False |
| 2022-01-02 | John | True |
| 2022-01-02 | Mark | False |
| 2022-01-03 | John | False |
| 2022-01-03 | Mark | False |
| 2022-01-04 | John | True |
| 2022-01-04 | Mark | False |
| 2022-01-05 | John | False |
| 2022-01-05 | Mark | True |
This can be done on an individual level by filtering on just one customer and doing an outer join with another DataFrame that has all the dates, and it will fill the empty ones with NaNs, but I can't do that with all the different people at customer which is what I need to do. I'd appreciate an approach that's as computationally efficient as possible and doesn't involve a lot of convoluted iterations over the dataset.
Solution 1:[1]
Here's one way (df your dataframe):
df.date = pd.to_datetime(df.date) # Just in case
customers = df.customer.unique()
df_fill = pd.DataFrame(
(
[day, customer]
for day in pd.date_range(df.date.min(), df.date.max(), freq="D")
for customer in customers
),
columns=["date", "customer"]
)
df = df_fill.merge(df, on=["date", "customer"], how="left")
df.attended = df.attended.fillna(False)
Output:
date customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
Solution 2:[2]
One option is with the complete function from pyjanitor, to expose the explicitly missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# generate new dates
new_dates = {"date": pd.date_range(df.date.min(), df.date.max(), freq="D")}
df.complete(new_dates, 'customer', sort = True).fillna({'attended':False})
date customer attended
0 2022-01-01 John True
1 2022-01-01 Mark False
2 2022-01-02 John True
3 2022-01-02 Mark False
4 2022-01-03 John False
5 2022-01-03 Mark False
6 2022-01-04 John True
7 2022-01-04 Mark False
8 2022-01-05 John False
9 2022-01-05 Mark True
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Timus |
| Solution 2 | sammywemmy |
