'Fill NaN of selected columns based on a dictionary whose keys are column names and values are content of anther column in Python
For the dataframe df1 as follows:
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN NaN NaN
1 M0066352 aluminum NaN NaN NaN
2 M0066353 gold NaN NaN NaN
3 M0066354 silver NaN NaN NaN
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore NaN NaN NaN
7 S0212352 coke NaN NaN NaN
8 S0212353 others 1.0 NaN 1.0
and I hope to fill columns cols = ['black metal', 'non-ferrous metals', 'precious metal'] with 1s based on customized_dict:
customized_dict = {
'black metal': ['iron ore', 'coke'],
'non-ferrous metals': ['copper', 'aluminum'],
'precious metal': ['gold', 'silver']
}
Please note the keys are from column names of df1 and values are from content of products in df1.
So my question is how could I get the following output:
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
EDIT: new data with duplicates in products column.
id products black metal non-ferrous metals precious metal
0 S0212350 Apple NaN NaN NaN
1 M0066352 aluminum NaN 1.0 NaN
2 S0212352 coke 1.0 NaN NaN
3 S0212354 coke 1.0 NaN NaN
4 M0066350 copper NaN 1.0 NaN
5 M0066353 gold NaN NaN 1.0
6 S0212351 iron ore 1.0 NaN NaN
7 S0212353 others 1.0 NaN 1.0
8 M0066354 silver NaN NaN 1.0
9 S0200837 soybean NaN NaN NaN
Solution 1:[1]
Using a simple loop on the columns and update:
customized_dict = {
'black metal': ['iron ore', 'coke'],
'non-ferrous metals': ['copper', 'aluminum'],
'precious metal': ['gold', 'silver']
}
df.update(df.iloc[:,2:].apply(lambda c: c[df['products']
.isin(customized_dict[c.name])]
.fillna(1)))
output:
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
Solution 2:[2]
Use:
# list comprehension for MultiIndex Series with 1
L = [(x, k) for k, v in customized_dict.items() for x in v]
# reshape for DataFrame
df2 = pd.Series(1, index=pd.MultiIndex.from_tuples(L)).unstack()
# replace missing values by products column converted to index
df = df1.set_index('products').combine_first(df2).rename_axis('products').reset_index().reindex(df1.columns, axis=1)
print(df)
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
Solution 3:[3]
Create a reverse dict mapping and use crosstab to create the updated array then fillna:
reversed_dict = {v: k for k, l in customized_dict.items() for v in l}
df1 = df1.fillna(pd.crosstab(df1.index, df1['products'].map(reversed_dict), values=1, aggfunc='mean'))
print(df1)
# Output
id products black metal non-ferrous metals precious metal
0 M0066350 copper NaN 1.0 NaN
1 M0066352 aluminum NaN 1.0 NaN
2 M0066353 gold NaN NaN 1.0
3 M0066354 silver NaN NaN 1.0
4 S0200837 soybean NaN NaN NaN
5 S0212350 Apple NaN NaN NaN
6 S0212351 iron ore 1.0 NaN NaN
7 S0212352 coke 1.0 NaN NaN
8 S0212353 others 1.0 NaN 1.0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
| Solution 2 | |
| Solution 3 | ah bon |
