'Concatenate all columns in a pandas dataframe
I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this
num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
print('Concatenating', col_names[each_col_ind])
df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]
Solution 1:[1]
Solution with sum, but output is float, so convert to int and str is necessary:
df['new'] = df.sum(axis=1).astype(int).astype(str)
Another solution with apply function join, but it the slowiest:
df['new'] = df.apply(''.join, axis=1)
Last very fast numpy solution - convert to numpy array and then 'sum':
df['new'] = df.values.sum(axis=1)
Timings:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
cols = list('ABC')
#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop
In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop
In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop
In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop
EDIT If dtypes of some columns are not object (obviously strings) cast by DataFrame.astype:
df['new'] = df.astype(str).values.sum(axis=1)
Solution 2:[2]
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Gives us:
df
Out[6]:
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
To select a given set of columns:
df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')
df
Out[8]:
A B C concat
0 1 4 7 14
1 2 5 8 25
2 3 6 9 36
However, I've noticed that approach can sometimes result in NaNs being populated where they shouldn't, so here's another way:
>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Although it should be noted that this approach is a lot slower:
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop
Versus
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop
Solution 3:[3]
I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.
For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Here is my modification for Python 3.6.3:
df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))
Solution 4:[4]
The solutions given above that use numpy arrays have worked great for me.
However, one thing to be careful about is the indexing when you get the numpy.ndarray from df.values, since the axis labels are removed from df.values.
So to take one of the solutions offered above (the one that I use most often) as an example:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
This portion:
df.fillna('').values
does not preserve the indices of the original DataFrame. Not a problem when the DataFrame has the common 0, 1, 2, ... row indexing scheme, but this solution will not work when the DataFrame is indexed in any other way. You can fix this by adding an index= argument to pd.Series():
df['concat'] = pd.Series(df.fillna('').values.tolist(),
index=df.index).str.join('')
I always add the index= argument just to be safe, even when I'm sure the DataFrame is row-indexed as 0, 1, 2, ...
Solution 5:[5]
This lambda approach offers some flexibility with columns chosen and separator type:
Setup:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Concatenate All Columns - no separator:
cols = ['A', 'B', 'C']
df['combined'] = df[cols].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
A B C combined
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Concatenate Two Columns A and C with '_' separator:
cols = ['A', 'C']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
A B C combined
0 1 4 7 1_7
1 2 5 8 2_8
2 3 6 9 3_9
Solution 6:[6]
as a solution to @Gary Dorman's question in the comment,i would want to have a delimiter in place so when you're looking at your overall column, you can see how it's broken out.
you maybe use
df_tmp=df.astype(str) + ','
df_tmp.sum(axis=1).str.rstrip(',')
before:
1.2.3.480tcp
6.6.6.680udp
7.7.7.78080tcp
8.8.8.88080tcp
9.9.9.98080tcp
after:
1.2.3.4,80,tcp
6.6.6.6,80,udp
7.7.7.7,8080,tcp
8.8.8.8,8080,tcp
9.9.9.9,8080,tcp
which looks better (like CSV :) This additional sep step is about 30% slower in my machine.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | bodily11 |
| Solution 4 | nick_montpetit |
| Solution 5 | Grant Shannon |
| Solution 6 | Osadhi Virochana |
