'GroupBy columns on column header prefix

I have a dataframe with column names that start with a set list of prefixes. I want to get the sum of the values in the dataframe grouped by columns that start with the same prefix.

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]],
              columns=['abc', 'abd', 'wxy', 'wxz'])
prefixes = ['ab','wx']
df
    abc abd wxy wxz
0   1   2   3   4
1   1   2   3   4
2   1   2   3   4
3   1   2   3   4

The only way I could figure out how to do it was to loop through the prefix list, get the columns from the dataframe that startwith that string, and then sum the results.

results = []
for p in prefixes:
  results.append([p, df.loc[:, df.columns.str.startswith(p)].values.sum()])
results = pd.DataFrame(results,)
results.set_index(keys=[0], drop=True).T

    ab  wx
1   12  28

I hoped there was a more elegant way to do it, perhaps with groupby(), but I couldn't figure it out.



Solution 1:[1]

Using groupby after slice your columns

df.groupby(df.columns.str[:-1],axis=1).sum().sum().to_frame().T
Out[317]: 
   ab  wx
0  12  28

Update

l=sum([[x]*df.columns.str.startswith(x).sum() for x in prefixes],[])
df.groupby(l,axis=1).sum().sum().to_frame().T
Out[329]: 
   ab  wx
0  12  28

Solution 2:[2]

I tried the grouper method suggested by @cs95 (on different data), but it missed 4 of the prefixes, where columns beginning with each of those 4 prefixes exist in the dataframe, so 4 columns were missing in the outputted dataframe.

I got different results when I then tried:

df.groupby(df.columns.str[:2], axis=1).sum().reset_index()

This second way includes the missing 4 prefixes (which are derived from df, here hardcoded as the first 2 characters of the column names), but also gives different summation results to the grouper method.

(This other method doesn't include the prefix list that OP specifically asked about)

Solution 3:[3]

Try this, this not use groupby:

pd.concat([df.iloc[:,df.columns.str.startswith(prefixes[i])].sum().to_frame().rename(columns = {0:prefixes[i]}).sum() for i in range(len(prefixes))])

@FullMetalScientist could you please kindly validate if this works in your new data frame?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 FullMetalScientist
Solution 3 Multivac