'Pandas fill in missing monthly dates in DataFrame, fill up one specific column with zeros
I am facing an issue with Pandas and how to fill up missing dates in a DataFrame. The structure of the given DataFrame is as follows:
Amount Code Type Date
0 34.97 J36J 74343 2016-01-01
1 16.32 J36J 74343 2016-04-01
2 10.30 J36J 69927 2015-12-01
3 10.45 J36J 69927 2016-07-01
4 5.63 J36J 69927 2017-03-01
5 15.79 J36J 69927 2018-09-01
6 15.00 J36J 69927 2019-06-01
7 6.44 J36J 69926 2016-03-01
8 6.47 J36J 69926 2017-03-01
9 15.00 J36J 69926 2018-07-01
10 15.00 J36J 69926 2019-06-01
- Amount: well, the amount
- Code: Productcode which is the same throughout the entire DataFrame
- Type: A Producttype, there are many different ones
- Date: A Daterange which spans the time between December 2015 to September 2020.
My goal is to have a monthly entry for every Type covering this timespan. Meaning, every Material should have 58 entries. The 'artificially' created monthly entries should have an amount of 0. So, my expected output would be (just for one Type, as an example)
Amount Code Type Date
0 34.97 J36J 74343 2016-01-01
1 16.32 J36J 74343 2016-02-01
2 0 J36J 74343 2016-03-01
3 0 J36J 74343 2016-04-01
4 0 J36J 74343 2016-05-01
5 0 J36J 74343 2016-06-01
6 0 J36J 74343 2016-07-01
7 0 J36J 74343 2016-08-01
8 0 J36J 74343 2016-09-01
9 0 J36J 74343 2016-10-01
10 0 J36J 74343 2016-11-01
11 0 J36J 74343 2016-12-01
Fortunately, somebody already had the same question ( Pandas fill in missing dates in DataFrame with multiple columns )
I adapted the quite helpful answer to my case:
df.Date=pd.to_datetime(df.Date)
s=pd.date_range(df.Date.min(),df.Date.max(),freq='MS')
df=df.set_index(['Code','Type','Date']).\
Amount.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
df
This worked quite well, but I checked the resulting DataFrame afterwards and it seems like some of the dates are missing.
398 74343 J36J 2016-01-01 34.97
399 74343 J36J 2016-02-01 0.00
400 74343 J36J 2016-04-01 16.32
401 74343 J36J 2016-05-01 0.00
402 74343 J36J 2016-06-01 0.00
403 74343 J36J 2016-08-01 0.00
404 74343 J36J 2016-10-01 0.00
405 74343 J36J 2016-11-01 0.00
406 74343 J36J 2016-12-01 0.00
Do any of you know what could be the reason for this? I'm assuming maybe it's because of the Frequency ('MS') I've chosen? but I cannot think any of the others could be fitting. (https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) OR do I have to set the datarange manually? In my initial DataFrame obviously not all Dates are available.
Any help on that matter is appreciated.
BR
Solution 1:[1]
One option is to use the complete function from pyjanitor to expose the implicitly missing rows; afterwards you can fill with fillna:
# pip install pyjanitor
import pandas as pd
import janitor
# create a dictionary of dates, spanning from December 2015 to September 2020
new_dates = {"Date": pd.date_range(start = "2015-12-01", end = "2020-09-01", freq="MS")}
# apply with `complete` for each group of `Type` and `Code`,
# and fill the null values in `Amount` with 0:
df.complete(new_dates, by=["Type", "Code"], sort=True).fillna({'Amount': 0})
Amount Code Type Date
0 0.0 J36J 9927 2015-12-01
1 0.0 J36J 9927 2016-01-01
2 0.0 J36J 9927 2016-02-01
3 0.0 J36J 9927 2016-03-01
4 0.0 J36J 9927 2016-04-01
.. ... ... ... ...
227 0.0 J36J 74343 2020-05-01
228 0.0 J36J 74343 2020-06-01
229 0.0 J36J 74343 2020-07-01
230 0.0 J36J 74343 2020-08-01
231 0.0 J36J 74343 2020-09-01
[232 rows x 4 columns]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | sammywemmy |
