'How to add a column in pandas dataframe based on other columns for large dataset?
I have a CSV file that contains 1,000,000 rows and columns like following.
col1 col2 col3...col20
0 1 0 ... 10
0 1 0 ... 20
1 0 0 ... 30
0 1 0 ... 40
0 0 1 ... 50
................
I want to add a new column called col1_col2_col3 like following.
col1 col2 col3 col1_col2_col3 ...col20
0 1 0 2 ... 10
0 1 0 2 ... 20
1 0 0 1 ... 30
0 1 0 2 ... 40
0 0 1 3 ... 50
.................
I have loaded the data file in a pandas data frame. Then tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'col1'] == 1):
df.loc[idx, 'col1_col2_col3'] = 1
elif (df.loc[idx, 'col2'] == 1)
df.loc[idx, 'col1_col2_col3'] = 2
elif (df.loc[idx, 'col3'] == 1)
df.loc[idx, 'col1_col2_col3'] = 3
The above solution wroks. However, my code is taking very long time to run. Is there any way to create col1_col2_col3 fast?
Solution 1:[1]
Here's one way using multiplication. The idea is to multiply each column by 1, 2 or 3 depending on which column it is, then keep the nonzero values:
df['col1_col2_col3'] = df[['col1','col2','col3']].mul([1,2,3]).mask(lambda x: x==0).bfill(axis=1)['col1'].astype(int)
N.B. It assumes that each row can have only one nonzero value in columns ['col1_col2_col3'].
Output:
col1 col2 col3 ... col20 col1_col2_col3
0 0 1 0 ... 10 2
1 0 1 0 ... 20 2
2 1 0 0 ... 30 1
3 0 1 0 ... 40 2
4 0 0 1 ... 50 3
Solution 2:[2]
You can use Numpy's argmax
df.assign(
col1_col2_col3=
df[['col1', 'col2', 'col3']].to_numpy().argmax(axis=1) + 1
)
col1 col2 col3 col20 col1_col2_col3
0 0 1 0 10 2
1 0 1 0 20 2
2 1 0 0 30 1
3 0 1 0 40 2
4 0 0 1 50 3
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | piRSquared |
