'GroupBy columns on column header prefix
I have a dataframe with column names that start with a set list of prefixes. I want to get the sum of the values in the dataframe grouped by columns that start with the same prefix.
df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]],
columns=['abc', 'abd', 'wxy', 'wxz'])
prefixes = ['ab','wx']
df
abc abd wxy wxz
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
The only way I could figure out how to do it was to loop through the prefix list, get the columns from the dataframe that startwith that string, and then sum the results.
results = []
for p in prefixes:
results.append([p, df.loc[:, df.columns.str.startswith(p)].values.sum()])
results = pd.DataFrame(results,)
results.set_index(keys=[0], drop=True).T
ab wx
1 12 28
I hoped there was a more elegant way to do it, perhaps with groupby(), but I couldn't figure it out.
Solution 1:[1]
Using groupby after slice your columns
df.groupby(df.columns.str[:-1],axis=1).sum().sum().to_frame().T
Out[317]:
ab wx
0 12 28
Update
l=sum([[x]*df.columns.str.startswith(x).sum() for x in prefixes],[])
df.groupby(l,axis=1).sum().sum().to_frame().T
Out[329]:
ab wx
0 12 28
Solution 2:[2]
I tried the grouper method suggested by @cs95 (on different data), but it missed 4 of the prefixes, where columns beginning with each of those 4 prefixes exist in the dataframe, so 4 columns were missing in the outputted dataframe.
I got different results when I then tried:
df.groupby(df.columns.str[:2], axis=1).sum().reset_index()
This second way includes the missing 4 prefixes (which are derived from df, here hardcoded as the first 2 characters of the column names), but also gives different summation results to the grouper method.
(This other method doesn't include the prefix list that OP specifically asked about)
Solution 3:[3]
Try this, this not use groupby:
pd.concat([df.iloc[:,df.columns.str.startswith(prefixes[i])].sum().to_frame().rename(columns = {0:prefixes[i]}).sum() for i in range(len(prefixes))])
@FullMetalScientist could you please kindly validate if this works in your new data frame?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | FullMetalScientist |
| Solution 3 | Multivac |
