'Unable to convert sparse matrix to dense matrix due to lack of memory
I am coding a machine learning model for text classification using a large dataset of ~200,000 records. I am using TfidfVectorizer which provides a sparse matrix as an output, however, GaussianNB only accepts a dense matrix. When attempting to convert from a sparse to a dense matrix, I receive the error message:
MemoryError: Unable to allocate 104. GiB for an array with shape (200853, 69207) and data type float64
The code:
# Loads dataset from Headlines.xlsx and removes unnecessary columns
print("Loading dataset...")
dataset = pd.read_excel("Headlines.xlsx")
dataset = dataset.loc[:, ['category', 'headline', 'authors', 'short_description']]
print("Dataset loaded.")
# Preproccess text
print("Processing headlines...")
dataset['headline'] = dataset['headline'].apply(lambda x: text_preprocessing(str(x)))
print("Headlines processed.")
print("Processing descriptions...")
dataset['short_description'] = dataset['short_description'].apply(lambda x: text_preprocessing(str(x)))
print("Descriptions processed.")
# Encode authors
label_encoder = preprocessing.LabelEncoder()
print("Encoding authors...")
authors_encoded = label_encoder.fit_transform(dataset['authors'])
print("Authors encoded.")
# Headline and description vectorisation
tfidf_vectoriser = TfidfVectorizer()
print("Vectorising headlines...")
headline_vectors = tfidf_vectoriser.fit_transform(np.array(dataset['headline'])).toarray()
print("Headlines vectorised.")
print("Vectorising descriptions...")
short_description_vectors = tfidf_vectoriser.fit_transform(np.array(dataset['short_description'])).toarray()
print("Descriptions vectorised.")
The error is produced at the line headline_vectors = tfidf_vectoriser.fit_transform(np.array(dataset['headline'])).toarray()
Is there a way to fix this/a better way of doing it all together?
Thanks, Alfie
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
