'Divide element by sum of groupby in dask without setting index for every column
I have code implemented in pandas, but am having trouble converting to dask because I need to use set_index(), what is the best work around? Using dask because I need to scale this to much larger dataframes.
I am looking to return a dataframe where each element is divided by the column-wise sum of a group. Example dataframe that looks like this
df = [
[1,4,2,1],
[4,4,0,-1],
[2,3,1,6],
[-2,1,0,-1],
[6,-3,-2,-1],
[1,0,5,5],
]
df = pd.DataFrame(df)
lab_id = ['a','b','a','b','a','c']
df['lab_id'] = lab_id
df
0 1 2 3 lab_id
0 1 4 2 1 a
1 4 4 0 -1 b
2 2 3 1 6 a
3 -2 1 0 -1 b
4 6 -3 -2 -1 a
5 1 0 5 5 c
Currently in pandas I do a groupby by sum to return a dataframe:
sum_df = df.groupby('lab_id').sum()
sum_df
0 1 2 3
lab_id
a 9 4 1 6
b 2 5 0 -2
c 1 0 5 5
And then I set the index of the original data frame and divide by the sum dataframe:
df.set_index('lab_id')/sum_df
0 1 2 3
lab_id
a 0.111111 1.00 2.0 0.166667
a 0.222222 0.75 1.0 1.000000
a 0.666667 -0.75 -2.0 -0.166667
b 2.000000 0.80 NaN 0.500000
b -1.000000 0.20 NaN 0.500000
c 1.000000 NaN 1.0 1.000000
The main problem is that I am having a huge issue setting index in dask, which explicitly mentions to avoid using set_index() and reset_index() methods. I simply can't find a way around doing so!
I have tried many arcane ways to set index outside of dask such as creating a new dataframe with the index already set and a row of dummy data and iteratively assigning the columns from the old dataframe (this is some of the worst code i've written).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
