'Memory filled up when compute/persist dask dataframe with 67 million rows

I encounter this issue while analyzing multiple df, each has around 67 million rows. I can compute() or export to_csv any individual df. I am using a for loop to create 50 df and append all of them to a list (I know it is not the best option to use dask with for loop, I am still figuring it out). Then, I take the list of 50 df and concatenate into 1 df with 50 cols and take the mean of it. However, I cannot do the compute() on the final dataframe-mean even though it has the same size with each indiv df which compute() works perfectly. The script gets killed by the OS as it takes up 95% of its allocated memory. I need the all of memory to plot a graph. I tried to repartition the df but it still gets killed by the OS.

I am not sure what happened or if there is any workaround way for it? Thank you!

import dask.dataframe as dd
import dask.array as da
Results = []

for i in range (50):
    fft_input = da.random.randint(low=-20,high=100,size=(67108864,)).rechunk(chunks=(67108864,))    #Create a 1D dask array with 67108864 elements
    data=dd.from_dask_array(fft_input)

    Results.append(data)   # Convert array to dataframe and store into a list

allResult = dd.multi.concat(Results, axis=1)        #The list contains 50 separate dataframe and concat as 1 dataframe with 50 columns
dfff = allResult.mean(axis=1)                       #Take the mean of the concat dataframe
print(dfff)
dfff.persist()                

Here is the result for the above

Dask Series Structure:
npartitions=1
0           float64
67108863        ...
dtype: float64
Dask Name: dataframe-mean, 402 tasks

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

Environment:

Dask version: 2022.2.1
Python version: 3.8
Operating System: Linux Ubuntu 18.04
Memory:  39.2 Gb
OS type: 64 bit
Install method (conda, pip, source): pip


Solution 1:[1]

I would make a function and use the yield keyword instead of return to turn the function into a generator and then you can iterate over the generator

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 BPLP