'Adding values for missing data combinations in Pandas
I've got a pandas data frame containing something like the following:
person_id status year count
0 'pass' 1980 4
0 'fail' 1982 1
1 'pass' 1981 2
If I know that all possible values for each field are:
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
I'd like to populate the original data frame with count=0 for missing data combinations (of person_id, status, and year), i.e. I'd like the new data frame to contain:
person_id status year count
0 'pass' 1980 4
0 'pass' 1981 0
0 'pass' 1982 0
0 'fail' 1980 0
0 'fail' 1981 0
0 'fail' 1982 2
1 'pass' 1980 0
1 'pass' 1981 2
1 'pass' 1982 0
1 'fail' 1980 0
1 'fail' 1981 0
1 'fail' 1982 0
2 'pass' 1980 0
2 'pass' 1981 0
2 'pass' 1982 0
2 'fail' 1980 0
2 'fail' 1981 0
2 'fail' 1982 0
Is there an efficient way to achieve this in pandas?
Solution 1:[1]
create a MultiIndex by MultiIndex.from_product() and then set_index(), reindex(), reset_index().
import pandas as pd
import io
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
df = pd.read_csv(io.BytesIO("""person_id status year count
0 pass 1980 4
0 fail 1982 1
1 pass 1981 2"""), delim_whitespace=True)
names = ["person_id", "status", "year"]
mind = pd.MultiIndex.from_product(
[all_person_ids, all_statuses, all_years], names=names)
df.set_index(names).reindex(mind, fill_value=0).reset_index()
Solution 2:[2]
You can use itertools.product to generate all combinations, then construct a df from this, merge it with your original df along with fillna to fill missing count values with 0:
In [77]:
import itertools
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
combined = [all_person_ids, all_statuses, all_years]
df1 = pd.DataFrame(columns = ['person_id', 'status', 'year'], data=list(itertools.product(*combined)))
df1
Out[77]:
person_id status year
0 0 pass 1980
1 0 pass 1981
2 0 pass 1982
3 0 fail 1980
4 0 fail 1981
5 0 fail 1982
6 1 pass 1980
7 1 pass 1981
8 1 pass 1982
9 1 fail 1980
10 1 fail 1981
11 1 fail 1982
12 2 pass 1980
13 2 pass 1981
14 2 pass 1982
15 2 fail 1980
16 2 fail 1981
17 2 fail 1982
In [82]:
df1 = df1.merge(df, how='left').fillna(0)
df1
Out[82]:
person_id status year count
0 0 pass 1980 4
1 0 pass 1981 0
2 0 pass 1982 0
3 0 fail 1980 0
4 0 fail 1981 0
5 0 fail 1982 1
6 1 pass 1980 0
7 1 pass 1981 2
8 1 pass 1982 0
9 1 fail 1980 0
10 1 fail 1981 0
11 1 fail 1982 0
12 2 pass 1980 0
13 2 pass 1981 0
14 2 pass 1982 0
15 2 fail 1980 0
16 2 fail 1981 0
17 2 fail 1982 0
Solution 3:[3]
You can use pyjanitor's complete method.
It accepts column names as input as well as {name: values} dictionaries with the exhaustive list of wanted values to complete:
import janitor
df.complete({'person_id': [0,1,2]}, 'status', 'year').fillna(0, downcast='infer')
output:
person_id status year count
0 0 'fail' 1980 0
1 0 'fail' 1981 0
2 0 'fail' 1982 1
3 0 'pass' 1980 4
4 0 'pass' 1981 0
5 0 'pass' 1982 0
6 1 'fail' 1980 0
7 1 'fail' 1981 0
8 1 'fail' 1982 0
9 1 'pass' 1980 0
10 1 'pass' 1981 2
11 1 'pass' 1982 0
12 2 'fail' 1980 0
13 2 'fail' 1981 0
14 2 'fail' 1982 0
15 2 'pass' 1980 0
16 2 'pass' 1981 0
17 2 'pass' 1982 0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | EdChum |
| Solution 3 | mozway |
