'How to groupby a column but keep all rows as columns
I have a dataframe that was a result of a join operation. This operation had multiple matches, resulting in multiple rows. I want to move resulting match rows to be moved in to columns. Here is an example:
import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'],
[111,'999888','some more data']],
columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
I get:
id var1 var2 B C
0 111 2 3 999 some data
1 111 2 3 999888 some more data
but really I want:
id var1 var2 B C B C
0 111 2 3 999 some data 999888 some more data
I was thinking pivot was what I wanted but it makes the value the columns, not what I want. How can I achieve this and what is this operation called?
EDIT: To clarify, I don't care about the column names, could be b1 and b2 etc.
EDIT2: Many of the solutions do not work if there are more matches. Here is another example:
a = pd.DataFrame([[111,2,3], [222,3,4]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data'], [111,'999888777','and some more data'], [222,'111222','some extra data'], [222,'222333','and more extra data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
Solution 1:[1]
With duplicate column names:
Solution allowing duplicate columns for multiple results per id
:
import pandas as pd
a = pd.DataFrame([[111,2,3], [222,3,4]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data'], [111,'999888777','and some more data'], [222,'111222','some extra data'], [222,'222333','and more extra data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
print(c)
resultColNames = ['B', 'C']
nResCols = len(resultColNames)
otherColNames = list(set(c.columns) - set(resultColNames))
maxResultSets = c.groupby('id').size().max()
deDup = True
intToResultCol = {i:resultColNames[i % nResCols] + ('_' + str(i // nResCols) if deDup else '') for i in range(maxResultSets * nResCols)}
rhs = pd.concat([
v[resultColNames].unstack().to_frame().swaplevel().sort_index().droplevel(1).reset_index(drop=True).T for _, v in c.groupby(otherColNames)
]).reset_index(drop=True).rename(columns=intToResultCol)
c = pd.concat([a, rhs], axis=1)
print(c)
Input:
id var1 var2 B C
0 111 2 3 999 some data
1 111 2 3 999888 some more data
2 111 2 3 999888777 and some more data
3 222 3 4 111222 some extra data
4 222 3 4 222333 and more extra data
Output:
id var1 var2 B C B C B C
0 111 2 3 999 some data 999888 some more data 999888777 and some more data
1 222 3 4 111222 some extra data 222333 and more extra data NaN NaN
Without duplicate column names:
Change deDup
boolean in code above to True:
deDup = True
Output:
id var1 var2 B_0 C_0 B_1 C_1 B_2 C_2
0 111 2 3 999 some data 999888 some more data 999888777 and some more data
1 222 3 4 111222 some extra data 222333 and more extra data NaN NaN
Old solution (prior to question update by OP):
A solution in brief:
c = pd.concat([
c.groupby('id').nth(0).drop(columns=['B','C']).reset_index(),
c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
], axis=1)
c = c.rename(columns={col:col[1] for col in c.columns if isinstance(col, tuple)})
Output:
id var1 var2 B C B C
0 111 2 3 999 some data 999888 some more data
Avoiding duplicate names:
Changing the final line renaming columns will ensure column names are not duplicated:
c = c.rename(columns={col:f'{col[1]}_{col[0]}' for col in c.columns if isinstance(col, tuple)})
Result:
id var1 var2 B_0 C_0 B_1 C_1
0 111 2 3 999 some data 999888 some more data
Using structured column names for each match:
To get a result without duplicate column names, instead assigning columns in the result set for each match a structured name in the form of a tuple of the result set number (0, 1, ...) and the column name (B
, C
), we can do this:
import pandas as pd
a = pd.DataFrame([[111,2,3]], columns=['id', 'var1', 'var2'])
b = pd.DataFrame([[111,'999','some data'], [111,'999888','some more data']], columns=['id', 'B', 'C'])
c = pd.merge(a, b, on='id')
print(c)
x = c.groupby('id').nth(0).drop(columns=['B','C']).reset_index()
y = c[['B','C']].unstack().to_frame().swaplevel().sort_index().T
c = pd.concat([x, y], axis=1)
print(c)
Output:
id var1 var2 (0, B) (0, C) (1, B) (1, C)
0 111 2 3 999 some data 999888 some more data
Solution 2:[2]
IIUC, you can first reshape your dataframe b
to force having duplicated columns, then join
to a
:
b2 = (b
.assign(col=b.groupby('id').cumcount())
.pivot(index='id', columns='col')
.sort_index(level='col', axis=1, sort_remaining=False)
.droplevel('col', axis=1)
)
# B C B C
# id
# 111 999 some data 999888 some more data
c = a.join(b2, on='id')
# id var1 var2 B C B C
# 0 111 2 3 999 some data 999888 some more data
with non-duplicated column names:
b2 = (b.assign(col=b.groupby('id').cumcount().add(1))
.pivot(index='id', columns='col')
.sort_index(level='col', axis=1, sort_remaining=False)
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: '_'.join(map(str,x))),
axis=1))
)
# B_1 C_1 B_2 C_2
# id
# 111 999 some data 999888 some more data
c = a.join(b2, on='id')
# id var1 var2 B_1 C_1 B_2 C_2
# 0 111 2 3 999 some data 999888 some more data
Solution 3:[3]
Let us stack
and unstack
to reshape the dataframe:
k = ['id', 'var1', 'var2']
c = c.set_index([*k, c.groupby(k).cumcount().add(1)]).stack().unstack([-1, -2])
c.columns = c.columns.map('{0[0]}_{0[1]}'.format)
Result
print(c)
B_1 C_1 B_2 C_2
id var1 var2
111 2 3 999 some data 999888 some more data
Solution 4:[4]
Use:
res = c.loc[:0, ['id', 'var1', 'var2']]
temp = c[['B', 'C']].values.flatten()
res[[f'new{i}' for i in range(len(temp))]]=temp
Output:
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Shubham Sharma |
Solution 4 |