'sklearn TfidfVectorizer giving MemoryError

I get a MemoryError: Unable to allocate 61.4 GiB for an array with shape (50000, 164921) and data type float64:

tfidf = TfidfVectorizer(analyzer=remove_stopwords)

X = tfidf.fit_transform(df['lemmatize'])
print(X.shape)

Output :  (50000, 164921)

Now,here comes the memory error

df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names())

MemoryError: Unable to allocate 61.4 GiB for an array with shape (50000, 164921) and data type float64



Solution 1:[1]

You are running low in memory, but there are chances of getting it done by changing the datatype from float64 to uint8. Please try this and let me know if it throws the same error again.

df = pd.DataFrame(np.array(X).astype(np.uint8))

Solution 2:[2]

We can limit data being processed. For instance the windowing function for the first 1000 records.

# Creating Document Term Matrix
# I'll use the word matrix as a different view on the data
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(analyzer='word')
data=cv.fit_transform(df_grouped['lemmatized_h'][:1000]) 
# I put limits on the data being fed into the Count vectorizer because of my limited ram


df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names())
# df_dtm = pd.DataFrame(np.array(data).astype(np.uint8), columns=cv.get_feature_names())
df_dtm.index=df_grouped.index[:1000]
df_dtm.head(3)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Amar Singh
Solution 2 Teddy