'Python garbage collection when rewriting variables

I am trying to run a very simple code like this in Python 3.9:

for i in some_list:
    large_data=pd.read_csv('rawdata_%s.csv'%i)
    procdata=somefunction(large_data)
    procdata.to_csv('file_%s.csv'%i)

The list has about ~2,000 elements. Each large_data file can be ~200MB. The processing actually leads to a very small file to save (<1MB).

I am running the code on a cluster and I allocate 8GB of memory for the task. I assumed because I keep rewriting the variables that the code is actually memory efficient, but sometimes I exceed the limit and get the following error:

slurmstepd: error: Job 5118871 exceeded memory limit (8323668 > 8192000), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 5118871 ON cac074 CANCELLED AT 2022-04-08T16:10:58 ***

What am I doing wrong? Isn't Python doing the garbage collection by itself? Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source