'Add sum columns back to pandas dataframe through chain style
I am trying to add a list of column sums back to my pandas dataframe through assign(). But I am not so sure how to do it when there's more than one column. What's the best way to do it or any other way to do it in a chain style given I have had other steps before?
data2.assign(data2[rate_name].abs() / data2.groupby(level = 'date')[rate_name].transform('sum'))
rate_water rate_fire rate_wood
id date
apple 2019-01-01 -0.500000 -0.500000 0.000000
orange 2019-01-01 -0.636364 -0.963636 3.000000
melon 2019-01-01 -0.333333 5.666667 27.888889
apple 2020-01-01 -0.100000 7.900000 76.000000
orange 2020-01-01 0.363636 -0.963636 26.500000
melon 2020-01-01 0.166667 6.166667 27.235043
apple 2021-01-01 0.328571 26.261702 84.220779
orange 2021-01-01 0.363636 28.036364 28.683673
melon 2021-01-01 0.611111 39.944444 27.679487
Reproducible:
from pandas import Timestamp
data2 = pd.DataFrame.from_dict({'rate_water': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.6363636363636364, ('melon', Timestamp('2019-01-01 00:00:00')): -0.33333333333333337, ('apple', Timestamp('2020-01-01 00:00:00')): -0.10000000000000009, ('orange', Timestamp('2020-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2020-01-01 00:00:00')): 0.16666666666666663, ('apple', Timestamp('2021-01-01 00:00:00')): 0.3285714285714285, ('orange', Timestamp('2021-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2021-01-01 00:00:00')): 0.611111111111111}, 'rate_fire': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2019-01-01 00:00:00')): 5.666666666666667, ('apple', Timestamp('2020-01-01 00:00:00')): 7.9, ('orange', Timestamp('2020-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2020-01-01 00:00:00')): 6.166666666666667, ('apple', Timestamp('2021-01-01 00:00:00')): 26.261702127659575, ('orange', Timestamp('2021-01-01 00:00:00')): 28.036363636363635, ('melon', Timestamp('2021-01-01 00:00:00')): 39.94444444444444}, 'rate_wood': {('apple', Timestamp('2019-01-01 00:00:00')): 0.0, ('orange', Timestamp('2019-01-01 00:00:00')): 3.0, ('melon', Timestamp('2019-01-01 00:00:00')): 27.88888888888889, ('apple', Timestamp('2020-01-01 00:00:00')): 76.0, ('orange', Timestamp('2020-01-01 00:00:00')): 26.5, ('melon', Timestamp('2020-01-01 00:00:00')): 27.235042735042736, ('apple', Timestamp('2021-01-01 00:00:00')): 84.22077922077922, ('orange', Timestamp('2021-01-01 00:00:00')): 28.683673469387756, ('melon', Timestamp('2021-01-01 00:00:00')): 27.67948717948718}})
rate_water rate_fire rate_wood sum_water sum_fire sum_wood
id date
apple 2019-01-01 -0.500000 -0.500000 0.000000 -1.469697 4.20303 30.888889
orange 2019-01-01 -0.636364 -0.963636 3.000000 -1.469697 4.20303 30.888889
melon 2019-01-01 -0.333333 5.666667 27.888889 -1.469697 4.20303 30.888889
apple 2020-01-01 -0.100000 7.900000 76.000000 0.430303 13.10303 129.735043
orange 2020-01-01 0.363636 -0.963636 26.500000 0.430303 13.10303 129.735043
melon 2020-01-01 0.166667 6.166667 27.235043 0.430303 13.10303 129.735043
apple 2021-01-01 0.328571 26.261702 84.220779 1.303319 94.24251 140.583940
orange 2021-01-01 0.363636 28.036364 28.683673 1.303319 94.24251 140.583940
melon 2021-01-01 0.611111 39.944444 27.679487 1.303319 94.24251 140.583940
Solution 1:[1]
Use dictionary comprehension for dict of Series and add it to dataFrame with unpack **:
from pandas import Timestamp
data2 = pd.DataFrame.from_dict({'rate_water': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.6363636363636364, ('melon', Timestamp('2019-01-01 00:00:00')): -0.33333333333333337, ('apple', Timestamp('2020-01-01 00:00:00')): -0.10000000000000009, ('orange', Timestamp('2020-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2020-01-01 00:00:00')): 0.16666666666666663, ('apple', Timestamp('2021-01-01 00:00:00')): 0.3285714285714285, ('orange', Timestamp('2021-01-01 00:00:00')): 0.36363636363636365, ('melon', Timestamp('2021-01-01 00:00:00')): 0.611111111111111}, 'rate_fire': {('apple', Timestamp('2019-01-01 00:00:00')): -0.5, ('orange', Timestamp('2019-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2019-01-01 00:00:00')): 5.666666666666667, ('apple', Timestamp('2020-01-01 00:00:00')): 7.9, ('orange', Timestamp('2020-01-01 00:00:00')): -0.9636363636363636, ('melon', Timestamp('2020-01-01 00:00:00')): 6.166666666666667, ('apple', Timestamp('2021-01-01 00:00:00')): 26.261702127659575, ('orange', Timestamp('2021-01-01 00:00:00')): 28.036363636363635, ('melon', Timestamp('2021-01-01 00:00:00')): 39.94444444444444}, 'rate_wood': {('apple', Timestamp('2019-01-01 00:00:00')): 0.0, ('orange', Timestamp('2019-01-01 00:00:00')): 3.0, ('melon', Timestamp('2019-01-01 00:00:00')): 27.88888888888889, ('apple', Timestamp('2020-01-01 00:00:00')): 76.0, ('orange', Timestamp('2020-01-01 00:00:00')): 26.5, ('melon', Timestamp('2020-01-01 00:00:00')): 27.235042735042736, ('apple', Timestamp('2021-01-01 00:00:00')): 84.22077922077922, ('orange', Timestamp('2021-01-01 00:00:00')): 28.683673469387756, ('melon', Timestamp('2021-01-01 00:00:00')): 27.67948717948718}})
data2.index.names=['id','date']
cols = ['rate_water','rate_fire','rate_wood']
data2 = data2.assign(**{rate_name.replace('rate','sum'):
data2[rate_name].abs() / data2.groupby(level = 'date')[rate_name].transform('sum')
for rate_name in cols})
print (data2)
rate_water rate_fire rate_wood sum_water sum_fire \
id date
apple 2019-01-01 -0.500000 -0.500000 0.000000 -0.340206 0.118962
orange 2019-01-01 -0.636364 -0.963636 3.000000 -0.432990 0.229272
melon 2019-01-01 -0.333333 5.666667 27.888889 -0.226804 1.348234
apple 2020-01-01 -0.100000 7.900000 76.000000 0.232394 0.602914
orange 2020-01-01 0.363636 -0.963636 26.500000 0.845070 0.073543
melon 2020-01-01 0.166667 6.166667 27.235043 0.387324 0.470629
apple 2021-01-01 0.328571 26.261702 84.220779 0.252104 0.278661
orange 2021-01-01 0.363636 28.036364 28.683673 0.279008 0.297492
melon 2021-01-01 0.611111 39.944444 27.679487 0.468888 0.423847
sum_wood
id date
apple 2019-01-01 0.000000
orange 2019-01-01 0.097122
melon 2019-01-01 0.902878
apple 2020-01-01 0.585809
orange 2020-01-01 0.204262
melon 2020-01-01 0.209928
apple 2021-01-01 0.599078
orange 2021-01-01 0.204032
melon 2021-01-01 0.196889
Another way is processing all columns together:
cols = ['rate_water','rate_fire','rate_wood']
data2 = data2.join(data2[cols].abs().div(data2.groupby(level = 'date')[cols].transform('sum') )
.rename(columns=lambda x: x.replace('rate','sum')))
cols = ['rate_water','rate_fire','rate_wood']
data2 = data2.assign(**data2[cols].abs().div(data2.groupby(level = 'date')[cols].transform('sum') )
.rename(columns=lambda x: x.replace('rate','sum')))
print (data2)
rate_water rate_fire rate_wood sum_water sum_fire \
id date
apple 2019-01-01 -0.500000 -0.500000 0.000000 -0.340206 0.118962
orange 2019-01-01 -0.636364 -0.963636 3.000000 -0.432990 0.229272
melon 2019-01-01 -0.333333 5.666667 27.888889 -0.226804 1.348234
apple 2020-01-01 -0.100000 7.900000 76.000000 0.232394 0.602914
orange 2020-01-01 0.363636 -0.963636 26.500000 0.845070 0.073543
melon 2020-01-01 0.166667 6.166667 27.235043 0.387324 0.470629
apple 2021-01-01 0.328571 26.261702 84.220779 0.252104 0.278661
orange 2021-01-01 0.363636 28.036364 28.683673 0.279008 0.297492
melon 2021-01-01 0.611111 39.944444 27.679487 0.468888 0.423847
sum_wood
id date
apple 2019-01-01 0.000000
orange 2019-01-01 0.097122
melon 2019-01-01 0.902878
apple 2020-01-01 0.585809
orange 2020-01-01 0.204262
melon 2020-01-01 0.209928
apple 2021-01-01 0.599078
orange 2021-01-01 0.204032
melon 2021-01-01 0.196889
Solution 2:[2]
One option is with the assign method and unpacking:
data2.assign(**data2
.abs()
.div(
data2.groupby('date')
.transform('sum'))
.rename(columns = lambda df: df.removeprefix('rate_'))
.add_prefix('sum_'))
rate_water rate_fire rate_wood sum_water sum_fire sum_wood
id date
apple 2019-01-01 -0.500000 -0.500000 0.000000 -0.340206 0.118962 0.000000
orange 2019-01-01 -0.636364 -0.963636 3.000000 -0.432990 0.229272 0.097122
melon 2019-01-01 -0.333333 5.666667 27.888889 -0.226804 1.348234 0.902878
apple 2020-01-01 -0.100000 7.900000 76.000000 0.232394 0.602914 0.585809
orange 2020-01-01 0.363636 -0.963636 26.500000 0.845070 0.073543 0.204262
melon 2020-01-01 0.166667 6.166667 27.235043 0.387324 0.470629 0.209928
apple 2021-01-01 0.328571 26.261702 84.220779 0.252104 0.278661 0.599078
orange 2021-01-01 0.363636 28.036364 28.683673 0.279008 0.297492 0.204032
melon 2021-01-01 0.611111 39.944444 27.679487 0.468888 0.423847 0.196889
Another option would be to concatenate along axis=1:
pd.concat([data2,
data2.abs()
.div(groupby('date')
.transform('sum'))
.rename(columns = lambda df: df.removeprefix('rate_'))
.add_prefix('sum_')],
axis = 1)
Out[103]:
rate_water rate_fire rate_wood sum_water sum_fire sum_wood
id date
apple 2019-01-01 -0.500000 -0.500000 0.000000 -0.340206 0.118962 0.000000
orange 2019-01-01 -0.636364 -0.963636 3.000000 -0.432990 0.229272 0.097122
melon 2019-01-01 -0.333333 5.666667 27.888889 -0.226804 1.348234 0.902878
apple 2020-01-01 -0.100000 7.900000 76.000000 0.232394 0.602914 0.585809
orange 2020-01-01 0.363636 -0.963636 26.500000 0.845070 0.073543 0.204262
melon 2020-01-01 0.166667 6.166667 27.235043 0.387324 0.470629 0.209928
apple 2021-01-01 0.328571 26.261702 84.220779 0.252104 0.278661 0.599078
orange 2021-01-01 0.363636 28.036364 28.683673 0.279008 0.297492 0.204032
melon 2021-01-01 0.611111 39.944444 27.679487 0.468888 0.423847 0.196889
I feel though that it would probably be cleaner to create temporary variables instead, especially if this is going to be in production code
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
