'How to add all missing combination in a dataframe
I have a dataframe of around 15000 rows, something like as below,
dataframe =
date | value | count | dayofWeek
01/01/2021 | 250 | 1 | 0
01/01/2021 | 235 | 2 | 0
01/01/2021 | 244 | 3 | 0
..
..
01/01/2021 | 255 | 25 | 0
02/01/2021 | 225 | 1 | 1
02/01/2021 | 230 | 2 | 1
..
..
02/01/2021 | 275 | 20 | 1
03/01/2021 | 265 | 1 | 2
..
..
The dataframe has some missing data for the count.As for each date, there should be 25 count, if the count number is not present, we want to add the data with value-0 , and dayofWeek as per the date. There will be always atleast 10 count present for a specific date.
I don't know where to start as I am new to pandas. Can someone help.
Solution 1:[1]
Create the complementary dataframe them merge it to your original dataframe
N = 5 # In your case 25
# First find the missing count for each date
cnt = df['date'].value_counts().loc[lambda x: x < N] \
.apply(lambda x: range(x+1, N+1)).explode()
# Create the dataframe
dfc = pd.DataFrame({'date': cnt.index, 'value': 0, 'count': cnt.values,
'dayofWeek': pd.to_datetime(cnt.index, dayfirst=True).dayofweek})
# Concatenate them:
out = pd.concat([df, dfc]).sort_values(['date', 'count']).reset_index(drop=True)
Output:
>>> out
date value count dayofWeek
0 01/01/2021 250 1 4
1 01/01/2021 235 2 4
2 01/01/2021 244 3 4
3 01/01/2021 244 4 4
4 01/01/2021 244 5 4
5 02/01/2021 225 1 5
6 02/01/2021 230 2 5
7 02/01/2021 275 3 5
8 02/01/2021 0 4 5 # <- Added
9 02/01/2021 0 5 5 # <- Added
10 03/01/2021 265 1 6
11 03/01/2021 0 2 6 # <- Added
12 03/01/2021 0 3 6 # <- Added
13 03/01/2021 0 4 6 # <- Added
14 03/01/2021 0 5 6 # <- Added
Setup MRE
data = {'date': ['01/01/2021']*5 + ['02/01/2021']*3 + ['03/01/2021'],
'value': [250, 235, 244, 244, 244, 225, 230, 275, 265],
'count': [1, 2, 3, 4, 5, 1, 2, 3, 1],
'dayofWeek': [4, 4, 4, 4, 4, 5, 5, 5, 6]}
df = pd.DataFrame(data)
print(df)
# Output
date value count dayofWeek
0 01/01/2021 250 1 4
1 01/01/2021 235 2 4
2 01/01/2021 244 3 4
3 01/01/2021 244 4 4
4 01/01/2021 244 5 4 # <- Full, do nothing
5 02/01/2021 225 1 5
6 02/01/2021 230 2 5
7 02/01/2021 275 3 5 # <- Missing 4 and 5
8 03/01/2021 265 1 6 # <- Missing from 2 to 5
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Corralien |
