'Pandas Dataframe create columns by grouping cells by values
I have the following problem. My DataFrame looks like this (only with 100.000 entries):
col_1 col_2 col_3
green yellow red
yellow green purple
green yellow red
yellow brown green
red yellow purple
red green yellow
What I want though, is that all the greens are in one column, all the reds and all the yellows, etc. So it should look like this:
col_1 col_2 col_3 col_4 col_5
green yellow red
green yellow purple
green yellow red
green yellow brown
yellow red purple
green yellow red
How do I do this? Thanks in advance.
Solution 1:[1]
Here is one approach with pandas.get_dummies or str.get_dummies:
# credit https://stackoverflow.com/a/71143503
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns
or
df2 = (
df.apply(lambda c: pd.get_dummies(c).stack())
.max(1)
.unstack()
.astype(int)
)
out = df2*df2.columns
output:
brown green purple red yellow
0 green red yellow
1 green purple yellow
2 green red yellow
3 brown green yellow
4 purple red yellow
5 green red yellow
alternative output:
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns
out.columns = [f'col_{i}' for i,_ in enumerate(out, start=1)]
output:
col_1 col_2 col_3 col_4 col_5
0 green red yellow
1 green purple yellow
2 green red yellow
3 brown green yellow
4 purple red yellow
5 green red yellow
Solution 2:[2]
Here's one approach: With get_dummies convert it to one-hot encoded columns; sum across the columns and use np.where to populate the DataFrame with column names. Finally, fix the column names:
s = pd.get_dummies(df)
s.columns = [c.split('_')[-1] for c in s.columns]
s = s.groupby(level=0, axis=1).sum()
out = (s.apply(lambda c: np.where(c, c.name, ''))
.rename(columns=dict(zip(s.columns, ['col5','col1','col4','col3','col2'])))
.sort_index(axis=1))
The same code using chained methods:
out = (pd.get_dummies(df.set_axis(['0']*3, axis=1))
.pipe(lambda x: x.set_axis([c.split('_')[1] for c in x], axis=1))
.groupby(level=0, axis=1).sum()
.apply(lambda c: np.where(c, c.name, ''))
.set_axis(['col5','col1','col4','col3','col2'], axis=1)
.sort_index(axis=1)
)
Output:
col1 col2 col3 col4 col5
0 green yellow red
1 green yellow purple
2 green yellow red
3 green yellow brown
4 yellow red purple
5 green yellow red
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
