'Pandas sum over rows using conditional substring of other columns
I want to calculate the sum of several columns based on the column name,
import pandas as pd
import numpy as np
dates = pd.date_range(start='2021-01-03',end='2021-02-02',freq='D')
df = pd.DataFrame(data={'date': dates,
'rabbit1_41a': abs(4+0.1*np.random.randn(len(dates))),
'rabbit4_100b': abs(5.2+0.5*np.random.randn(len(dates))),
'kitten11_445a': abs(0.5+0.1*np.random.randn(len(dates))),
'kitten11_72c': abs(0.8+0.5*np.random.randn(len(dates))),
'hare2_1000': abs(7+np.random.randn(len(dates))),
'hare1_58': abs(8+0.8*np.random.randn(len(dates))),
'hare1_26': abs(7.6+0.2*np.random.randn(len(dates))),
'hare3_25': abs(9.1+0.3*np.random.randn(len(dates))),
}
)
# new table, contains sum of rabbit, kitten, hare
df0 = pd.DataFrame(data={'date': dates})
species = ['rabbit', 'kitten', 'hare']
for ii in species:
for jj in df.columns:
# calculate sum of rabbit, kitten, hare
## df0[ii] = df.loc[df[jj][0:int(ii.rindex(ii[-1])+1)]==ii].sum(axis=1)
df0[ii] = np.select([df[jj].str.contains(ii)]).sum(axis=1)
print(df0.head())
The raw data in df contains daily measurements and I have a new table df0 that covers the same time period. I want to calculate the sum of each column, like
df0['rabbit'] = df['rabbit1_41a'] + df['rabbit4_100b']
df0['kitten'] = df['kitten11_445a'] + df['kitten11_72c']
df0['hare'] = df['hare2_1000'] + df['hare1_58'] + df['hare1_26'] + df['hare3_25']
How is it done with string slices? This post is useful, but I'm not sure how to adapt it to take the sum of columns.
Solution 1:[1]
If the first 4 characters of the column identify its group uniquely, you can simply do
df1 = df.groupby([c[:4] for c in df.columns], axis=1).sum()
to get something like
date hare kitt rabb
0 0.0 30.276673 1.076560 9.665169
1 0.0 31.774017 1.445791 10.263471
2 0.0 32.620976 1.627564 8.708358
...
which is not quite there yet but close. To beat it into the right shape you could rename the columns and join with df['date']] as the date in df1 got messed up:
species = ['rabbit', 'kitten', 'hare']
sd = {s[:4]:s for s in species}
df[['date']].join(df1.drop(columns = 'date')).rename(columns = sd)
output
date hare kitten rabbit
0 2021-01-03 30.276673 1.076560 9.665169
1 2021-01-04 31.774017 1.445791 10.263471
2 2021-01-05 32.620976 1.627564 8.708358
...
Solution 2
Here we do exact matching assuming species is given as in your question. Then we groupby on that
spnames = [next(s for s in species if s in colname) for colname in df.columns[1:]]
df.set_index('date').groupby(spnames, axis=1).sum()
output as required; for reference
print(spnames)
['rabbit', 'rabbit', 'kitten', 'kitten', 'hare', 'hare', 'hare', 'hare']
Solution 2:[2]
Follow the following steps
Rename columns stripping the underscore and trailing alphanumeric characters
Step 1 above allows you to melt the columns into elements defined in list
Convert the dataframe from wide to long
Groupby dates and sum. This in my opinion should be the final stage. No need to join to df0
species = ['rabbit', 'kitten', 'hare']
#Rename df columns to allow use of pd.wide_to_long
df.columns =df.columns.str.replace('\_\w+$','', regex=True)
new=(df0.set_index('date')#Set date as index
.join(
pd.wide_to_long(df, species, i="date", j="suffix").droplevel(level=1)#Melt columns into each of the element in the list and sum
.groupby('date').sum()#Do summation by date
)
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | wwnde |
