'Dataframe same key with multiple values and One Hot Encoding (Python, Pandas)?
I have the following toy dataframe example:
import pandas as pd
df = pd.DataFrame({'id': [0, 0, 0], 'key':['role', 'role', 'role'], 'val': ['admin', 'local_usr', 'fin_dep_ds']})
As you may see the same key has multiple values.
When I pivot the table, the agg function I am using is ",".join() but the final goal is to apply one hot encoding on these values:
pd.DataFrame(pd.pivot_table(df, \
values='val', \
index='id', \
columns='key', \
aggfunc=','.join).to_records())
Final goal:
id admin local_usr fin_dep_ds
0 1 1 1
Please advise how can I do it? Any best practice to tackle this situation?
Solution 1:[1]
IIUC, .pivot_table() with aggfunc="size" produces your result:
x = df.pivot_table(index="id", columns="val", aggfunc="size").reset_index()
x.columns.name = None
print(x)
Prints:
id admin fin_dep_ds local_usr
0 0 1 1 1
Solution 2:[2]
Another option is to use pandas.crosstab
res = (
pd.crosstab(index=df.id, columns=df.val)
.reset_index()
.rename_axis(columns=None)
)
Output:
>>> res
id admin fin_dep_ds local_usr
0 0 1 1 1
Setup:
import pandas as pd
df = pd.DataFrame({
'id': [0, 0, 0],
'key':['role', 'role', 'role'],
'val': ['admin', 'local_usr', 'fin_dep_ds']
})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andrej Kesely |
| Solution 2 | Rodalm |
