'How to groupby a column but keep all rows as columns

I have a dataframe that was a result of a join operation. This operation had multiple matches, resulting in multiple rows. I want to move resulting match rows to be moved in to columns. Here is an example:

import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'],
                  [111,'999888','some more data']],
                  columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')

I get:

    id      var1    var2    B       C
0   111     2       3       999     some data
1   111     2       3       999888  some more data

but really I want:

    id  var1    var2    B   C           B       C
0   111 2       3       999 some data   999888  some more data

I was thinking pivot was what I wanted but it makes the value the columns, not what I want. How can I achieve this and what is this operation called?

EDIT: To clarify, I don't care about the column names, could be b1 and b2 etc.

EDIT2: Many of the solutions do not work if there are more matches. Here is another example:

a = pd.DataFrame([[111,2,3], [222,3,4]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data'], [111,'999888777','and some more data'], [222,'111222','some extra data'], [222,'222333','and more extra data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')


Solution 1:[1]

With duplicate column names:

Solution allowing duplicate columns for multiple results per id:

import pandas as pd
a = pd.DataFrame([[111,2,3], [222,3,4]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data'], [111,'999888777','and some more data'], [222,'111222','some extra data'], [222,'222333','and more extra data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
print(c)

resultColNames = ['B', 'C']
nResCols = len(resultColNames)
otherColNames = list(set(c.columns) - set(resultColNames))
maxResultSets = c.groupby('id').size().max()
deDup = True
intToResultCol = {i:resultColNames[i % nResCols] + ('_' + str(i // nResCols) if deDup else '') for i in range(maxResultSets * nResCols)}

rhs = pd.concat([
            v[resultColNames].unstack().to_frame().swaplevel().sort_index().droplevel(1).reset_index(drop=True).T for _, v in c.groupby(otherColNames)
        ]).reset_index(drop=True).rename(columns=intToResultCol)

c = pd.concat([a, rhs], axis=1)
print(c)

Input:

    id  var1  var2          B                    C
0  111     2     3        999            some data
1  111     2     3     999888       some more data
2  111     2     3  999888777   and some more data
3  222     3     4     111222      some extra data
4  222     3     4     222333  and more extra data

Output:

    id  var1  var2       B                C       B                    C          B                   C
0  111     2     3     999        some data  999888       some more data  999888777  and some more data
1  222     3     4  111222  some extra data  222333  and more extra data        NaN                 NaN

Without duplicate column names:

Change deDup boolean in code above to True:

deDup = True

Output:

    id  var1  var2     B_0              C_0     B_1                  C_1        B_2                 C_2
0  111     2     3     999        some data  999888       some more data  999888777  and some more data
1  222     3     4  111222  some extra data  222333  and more extra data        NaN                 NaN

Old solution (prior to question update by OP):

A solution in brief:

c = pd.concat([
    c.groupby('id').nth(0).drop(columns=['B','C']).reset_index(),
    c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
    ], axis=1)
c = c.rename(columns={col:col[1] for col in c.columns if isinstance(col, tuple)})

Output:

    id  var1  var2    B          C       B               C
0  111     2     3  999  some data  999888  some more data

Avoiding duplicate names:

Changing the final line renaming columns will ensure column names are not duplicated:

c = c.rename(columns={col:f'{col[1]}_{col[0]}' for col in c.columns if isinstance(col, tuple)})

Result:

    id  var1  var2  B_0        C_0     B_1             C_1
0  111     2     3  999  some data  999888  some more data

Using structured column names for each match:

To get a result without duplicate column names, instead assigning columns in the result set for each match a structured name in the form of a tuple of the result set number (0, 1, ...) and the column name (B, C), we can do this:

import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
print(c)

x = c.groupby('id').nth(0).drop(columns=['B','C']).reset_index()
y = c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
c = pd.concat([x, y], axis=1)
print(c)

Output:

    id  var1  var2 (0, B)     (0, C)  (1, B)          (1, C)
0  111     2     3    999  some data  999888  some more data

Solution 2:[2]

IIUC, you can first reshape your dataframe b to force having duplicated columns, then join to a:

b2 = (b
  .assign(col=b.groupby('id').cumcount())
  .pivot(index='id', columns='col')
  .sort_index(level='col', axis=1, sort_remaining=False)
  .droplevel('col', axis=1)
)

#        B          C       B               C
# id                                         
# 111  999  some data  999888  some more data


c = a.join(b2, on='id')

#     id  var1  var2    B          C       B               C
# 0  111     2     3  999  some data  999888  some more data

with non-duplicated column names:

b2 = (b.assign(col=b.groupby('id').cumcount().add(1))
  .pivot(index='id', columns='col')
  .sort_index(level='col', axis=1, sort_remaining=False)
  .pipe(lambda d: d.set_axis(d.columns.map(lambda x: '_'.join(map(str,x))),
                             axis=1))
)

#      B_1        C_1     B_2             C_2
# id                                         
# 111  999  some data  999888  some more data


c = a.join(b2, on='id')

#     id  var1  var2  B_1        C_1     B_2             C_2
# 0  111     2     3  999  some data  999888  some more data

Solution 3:[3]

Let us stack and unstack to reshape the dataframe:

k = ['id', 'var1', 'var2']
c = c.set_index([*k, c.groupby(k).cumcount().add(1)]).stack().unstack([-1, -2])
c.columns = c.columns.map('{0[0]}_{0[1]}'.format)

Result

print(c)

               B_1        C_1     B_2             C_2
id  var1 var2                                        
111 2    3     999  some data  999888  some more data

Solution 4:[4]

Use:

res = c.loc[:0, ['id', 'var1', 'var2']]
temp = c[['B', 'C']].values.flatten()
res[[f'new{i}' for i in range(len(temp))]]=temp

Output:

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Shubham Sharma
Solution 4