'Pandas Dataframe create columns by grouping cells by values

I have the following problem. My DataFrame looks like this (only with 100.000 entries):

col_1      col_2        col_3
green      yellow       red
yellow     green        purple
green      yellow       red
yellow     brown        green
red        yellow       purple
red        green        yellow 

What I want though, is that all the greens are in one column, all the reds and all the yellows, etc. So it should look like this:

col_1      col_2        col_3      col_4      col_5 
green      yellow       red
green      yellow                  purple
green      yellow       red
green      yellow                             brown
           yellow       red        purple
green      yellow       red

How do I do this? Thanks in advance.



Solution 1:[1]

Here is one approach with pandas.get_dummies or str.get_dummies:

# credit https://stackoverflow.com/a/71143503
df2 = df.apply('|'.join, axis=1).str.get_dummies()
out = df2*df2.columns

or

df2 = (
 df.apply(lambda c: pd.get_dummies(c).stack())
   .max(1)
   .unstack()
   .astype(int)
)
out = df2*df2.columns

output:

   brown  green  purple  red  yellow
0         green          red  yellow
1         green  purple       yellow
2         green          red  yellow
3  brown  green               yellow
4                purple  red  yellow
5         green          red  yellow

alternative output:

df2 = df.apply('|'.join, axis=1).str.get_dummies()

out = df2*df2.columns

out.columns = [f'col_{i}' for i,_ in enumerate(out, start=1)]

output:

   col_1  col_2   col_3 col_4   col_5
0         green           red  yellow
1         green  purple        yellow
2         green           red  yellow
3  brown  green                yellow
4                purple   red  yellow
5         green           red  yellow

Solution 2:[2]

Here's one approach: With get_dummies convert it to one-hot encoded columns; sum across the columns and use np.where to populate the DataFrame with column names. Finally, fix the column names:

s = pd.get_dummies(df)
s.columns = [c.split('_')[-1] for c in s.columns]
s = s.groupby(level=0, axis=1).sum()
out = (s.apply(lambda c: np.where(c, c.name, ''))
       .rename(columns=dict(zip(s.columns, ['col5','col1','col4','col3','col2'])))
       .sort_index(axis=1))

The same code using chained methods:

out = (pd.get_dummies(df.set_axis(['0']*3, axis=1))
       .pipe(lambda x: x.set_axis([c.split('_')[1] for c in x], axis=1))
       .groupby(level=0, axis=1).sum()
       .apply(lambda c: np.where(c, c.name, ''))
       .set_axis(['col5','col1','col4','col3','col2'], axis=1)
       .sort_index(axis=1)
      )

Output:

    col1    col2 col3    col4   col5
0  green  yellow  red               
1  green  yellow       purple       
2  green  yellow  red               
3  green  yellow               brown
4         yellow  red  purple       
5  green  yellow  red               

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2