'Unable to convert sparse matrix to dense matrix due to lack of memory

I am coding a machine learning model for text classification using a large dataset of ~200,000 records. I am using TfidfVectorizer which provides a sparse matrix as an output, however, GaussianNB only accepts a dense matrix. When attempting to convert from a sparse to a dense matrix, I receive the error message:

MemoryError: Unable to allocate 104. GiB for an array with shape (200853, 69207) and data type float64

The code:

# Loads dataset from Headlines.xlsx and removes unnecessary columns
print("Loading dataset...")
dataset = pd.read_excel("Headlines.xlsx")
dataset = dataset.loc[:, ['category', 'headline', 'authors', 'short_description']]
print("Dataset loaded.")

# Preproccess text
print("Processing headlines...")
dataset['headline'] = dataset['headline'].apply(lambda x: text_preprocessing(str(x)))
print("Headlines processed.")
print("Processing descriptions...")
dataset['short_description'] = dataset['short_description'].apply(lambda x: text_preprocessing(str(x)))
print("Descriptions processed.")

# Encode authors
label_encoder = preprocessing.LabelEncoder()
print("Encoding authors...")
authors_encoded = label_encoder.fit_transform(dataset['authors'])
print("Authors encoded.")

# Headline and description vectorisation
tfidf_vectoriser = TfidfVectorizer()
print("Vectorising headlines...")
headline_vectors = tfidf_vectoriser.fit_transform(np.array(dataset['headline'])).toarray()
print("Headlines vectorised.")
print("Vectorising descriptions...")
short_description_vectors = tfidf_vectoriser.fit_transform(np.array(dataset['short_description'])).toarray()
print("Descriptions vectorised.")

The error is produced at the line headline_vectors = tfidf_vectoriser.fit_transform(np.array(dataset['headline'])).toarray()

Is there a way to fix this/a better way of doing it all together?

Thanks, Alfie



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source