'How to add all missing combination in a dataframe

I have a dataframe of around 15000 rows, something like as below,

dataframe = 
date        |  value  |  count  |  dayofWeek
01/01/2021  |  250    |   1     |    0
01/01/2021  |  235    |   2     |    0
01/01/2021  |  244    |   3     |    0
..
..
01/01/2021  |  255    |  25     |    0
02/01/2021  |  225    |   1     |    1
02/01/2021  |  230    |   2     |    1
..
..
02/01/2021  |  275    |   20    |    1
03/01/2021  |  265    |   1     |    2
..
..

The dataframe has some missing data for the count.As for each date, there should be 25 count, if the count number is not present, we want to add the data with value-0 , and dayofWeek as per the date. There will be always atleast 10 count present for a specific date.

I don't know where to start as I am new to pandas. Can someone help.



Solution 1:[1]

Create the complementary dataframe them merge it to your original dataframe

N = 5  # In your case 25

# First find the missing count for each date
cnt = df['date'].value_counts().loc[lambda x: x < N] \
               .apply(lambda x: range(x+1, N+1)).explode()

# Create the dataframe
dfc = pd.DataFrame({'date': cnt.index, 'value': 0, 'count': cnt.values, 
                    'dayofWeek': pd.to_datetime(cnt.index, dayfirst=True).dayofweek})

# Concatenate them:
out = pd.concat([df, dfc]).sort_values(['date', 'count']).reset_index(drop=True)

Output:

>>> out
          date  value count  dayofWeek
0   01/01/2021    250     1          4
1   01/01/2021    235     2          4
2   01/01/2021    244     3          4
3   01/01/2021    244     4          4
4   01/01/2021    244     5          4
5   02/01/2021    225     1          5
6   02/01/2021    230     2          5
7   02/01/2021    275     3          5
8   02/01/2021      0     4          5  # <- Added
9   02/01/2021      0     5          5  # <- Added
10  03/01/2021    265     1          6
11  03/01/2021      0     2          6  # <- Added
12  03/01/2021      0     3          6  # <- Added
13  03/01/2021      0     4          6  # <- Added
14  03/01/2021      0     5          6  # <- Added

Setup MRE

data = {'date': ['01/01/2021']*5 + ['02/01/2021']*3 + ['03/01/2021'],
        'value': [250, 235, 244, 244, 244, 225, 230, 275, 265],
        'count': [1, 2, 3, 4, 5, 1, 2, 3, 1],
        'dayofWeek': [4, 4, 4, 4, 4, 5, 5, 5, 6]}
df = pd.DataFrame(data)
print(df)

# Output
         date  value  count  dayofWeek
0  01/01/2021    250      1          4
1  01/01/2021    235      2          4
2  01/01/2021    244      3          4
3  01/01/2021    244      4          4
4  01/01/2021    244      5          4  # <- Full, do nothing
5  02/01/2021    225      1          5
6  02/01/2021    230      2          5
7  02/01/2021    275      3          5  # <- Missing 4 and 5
8  03/01/2021    265      1          6  # <- Missing from 2 to 5

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Corralien