'How to load and preprocess a dataset by chunks?
I have a large data frame to which I would like to apply a set of functions to one of its columns using pipeline and progress_apply().
Here is my code snippet.
df = # a dataFrame object with multiple columns where df.columns[-1] == 'text'
from tqdm.auto import tqdm
tqdm.pandas()
pipeline = # list of pre-defined methods
prepare(text, pipeline):
"""
a method that cleanup and remove stop words from text input
"""
return # list of clean tokens
# MemoryError! when reaching 50% of cleaning progress
df = df['text'].progress_apply(prepare, pipeline=pipeline)
I am trying to solve the issue of MemoryError using progress_apply() but loading data by chunks. I have no idea of how I can do this with progress_apply().
I tried the following:
for i in range(0, df.shape[0], 47):
df = df['text'][i:i+47].progress_apply(prepare, pipeline=pipeline)
What I have tried doesn't same the previous ranges.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
