'Pandas sum over rows using conditional substring of other columns

I want to calculate the sum of several columns based on the column name,

import pandas as pd
import numpy as np

dates = pd.date_range(start='2021-01-03',end='2021-02-02',freq='D')
df = pd.DataFrame(data={'date': dates,
                        'rabbit1_41a': abs(4+0.1*np.random.randn(len(dates))),
                        'rabbit4_100b': abs(5.2+0.5*np.random.randn(len(dates))),
                        'kitten11_445a': abs(0.5+0.1*np.random.randn(len(dates))),
                        'kitten11_72c': abs(0.8+0.5*np.random.randn(len(dates))),
                        'hare2_1000': abs(7+np.random.randn(len(dates))),
                        'hare1_58': abs(8+0.8*np.random.randn(len(dates))),
                        'hare1_26': abs(7.6+0.2*np.random.randn(len(dates))),
                        'hare3_25': abs(9.1+0.3*np.random.randn(len(dates))),
                        }
                  )

# new table, contains sum of rabbit, kitten, hare
df0 = pd.DataFrame(data={'date': dates})

species = ['rabbit', 'kitten', 'hare']

for ii in species:
    for jj in df.columns:
        # calculate sum of rabbit, kitten, hare
##        df0[ii] = df.loc[df[jj][0:int(ii.rindex(ii[-1])+1)]==ii].sum(axis=1)
        df0[ii] = np.select([df[jj].str.contains(ii)]).sum(axis=1)
print(df0.head())

The raw data in df contains daily measurements and I have a new table df0 that covers the same time period. I want to calculate the sum of each column, like

df0['rabbit'] = df['rabbit1_41a'] + df['rabbit4_100b']     
df0['kitten'] = df['kitten11_445a'] + df['kitten11_72c']
df0['hare'] = df['hare2_1000'] + df['hare1_58'] + df['hare1_26'] + df['hare3_25']

How is it done with string slices? This post is useful, but I'm not sure how to adapt it to take the sum of columns.

Solution 1:^[1]

If the first 4 characters of the column identify its group uniquely, you can simply do

df1 = df.groupby([c[:4] for c in df.columns], axis=1).sum()

to get something like

    date    hare        kitt        rabb
0   0.0     30.276673   1.076560    9.665169
1   0.0     31.774017   1.445791    10.263471
2   0.0     32.620976   1.627564    8.708358
...

which is not quite there yet but close. To beat it into the right shape you could rename the columns and join with df['date']] as the date in df1 got messed up:

species = ['rabbit', 'kitten', 'hare']
sd = {s[:4]:s for s in species}
df[['date']].join(df1.drop(columns = 'date')).rename(columns = sd)

output

    date         hare       kitten      rabbit
0   2021-01-03  30.276673   1.076560    9.665169
1   2021-01-04  31.774017   1.445791    10.263471
2   2021-01-05  32.620976   1.627564    8.708358
...

Solution 2

Here we do exact matching assuming species is given as in your question. Then we groupby on that

spnames = [next(s for s in species if s in colname) for colname in df.columns[1:]]
df.set_index('date').groupby(spnames, axis=1).sum()

output as required; for reference

print(spnames)
['rabbit', 'rabbit', 'kitten', 'kitten', 'hare', 'hare', 'hare', 'hare']

Solution 2:^[2]

Follow the following steps

Rename columns stripping the underscore and trailing alphanumeric characters

Step 1 above allows you to melt the columns into elements defined in list

Convert the dataframe from wide to long

Groupby dates and sum. This in my opinion should be the final stage. No need to join to df0

species = ['rabbit', 'kitten', 'hare']

#Rename df columns to allow use of pd.wide_to_long

df.columns =df.columns.str.replace('\_\w+$','', regex=True)

new=(df0.set_index('date')#Set date as index
      .join(
          pd.wide_to_long(df, species, i="date", j="suffix").droplevel(level=1)#Melt columns into each of the element in the list and sum
          .groupby('date').sum()#Do summation by date
      )
          
      )

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	wwnde

'Pandas sum over rows using conditional substring of other columns

Solution 1:[1]

Solution 2

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]