'How to load and preprocess a dataset by chunks?

I have a large data frame to which I would like to apply a set of functions to one of its columns using pipeline and progress_apply().

Here is my code snippet.

df = # a dataFrame object with multiple columns where df.columns[-1] == 'text' 
from tqdm.auto import tqdm
tqdm.pandas()

pipeline = # list of pre-defined methods
prepare(text, pipeline):
   """
   a method that cleanup and remove stop words from text input
   """
   return # list of clean tokens

# MemoryError! when reaching 50% of cleaning progress
df = df['text'].progress_apply(prepare, pipeline=pipeline)  

I am trying to solve the issue of MemoryError using progress_apply() but loading data by chunks. I have no idea of how I can do this with progress_apply(). I tried the following:

for i in range(0, df.shape[0], 47):
   df = df['text'][i:i+47].progress_apply(prepare, pipeline=pipeline)  

What I have tried doesn't same the previous ranges.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source