'numpy.core._exceptions.MemoryError: Unable to allocate 75.2 GiB for an array with shape (1, 10086709500) and data type float64
I am training IsolationForest for anomaly detection in amazon SageMaker using the following code:-
def build_model(df, no_of_estimators, maximum_samples, error_percent, maximum_features):
encoder = MultiColumnLabelEncoder(columns=columns_to_encode)
df = df.dropna()
df = encoder.fit_transform(df)
print(df.head())
print(df.tail())
print(df.shape)
isf = IsolationForest(n_estimators=no_of_estimators, max_samples=maximum_samples, contamination=error_percent, max_features=df.shape[1],
bootstrap=False, n_jobs=6,verbose=2, random_state=42)
isf.fit(df)
df['outlier'] = isf.predict(df)
df['score'] = isf.decision_function(df.drop(['outlier'], axis=1))
df = encoder.inverse_transform(df)
return isf, df
But I am getting the following error:-
Traceback (most recent call last):
File "train_script_new.py", line 127, in <module>
isf, prediction_df = build_model(dataset, no_of_estimators, maximum_samples,
error_percent, maximum_features)
File "train_script_new.py", line 102, in build_model
isf.fit(df)
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 274,
in fit
self._threshold_ = np.percentile(self.decision_function(X),
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 345,
in decision_function
return self.score_samples(X) - self.offset_
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 403,
in score_samples
depths += _average_path_length(n_samples_leaf)
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 448,
in _average_path_length
average_path_length = np.zeros(n_samples_leaf.shape)
numpy.core._exceptions.MemoryError: Unable to allocate 75.2 GiB for an array with
shape (1, 10086709500) and data type float64
The shape of the dataset i am passing is:- df.shape: (20226508, 5) but in the error it says (1, 10086709500) is this because something is happening internally in the algorithm which is creating this array?
These are the last 3 lines of the logs printed by the model:-
Building estimator 500 of 500 for this parallel run (total 500)...
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 6.0min remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 6.0min finished
How could I come out of this error? I have tried to train model with larger instance which didn't work. currently i am training with ml.m4.16xlarge instance. Please help.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
