'numpy.core._exceptions.MemoryError: Unable to allocate 75.2 GiB for an array with shape (1, 10086709500) and data type float64

I am training IsolationForest for anomaly detection in amazon SageMaker using the following code:-

def build_model(df, no_of_estimators, maximum_samples, error_percent, maximum_features):
        encoder = MultiColumnLabelEncoder(columns=columns_to_encode)
        df = df.dropna()
        df = encoder.fit_transform(df)
        print(df.head())
        print(df.tail())
        print(df.shape)
        isf = IsolationForest(n_estimators=no_of_estimators, max_samples=maximum_samples, contamination=error_percent, max_features=df.shape[1],
                              bootstrap=False, n_jobs=6,verbose=2, random_state=42)
        isf.fit(df)
        df['outlier'] = isf.predict(df)
        df['score'] = isf.decision_function(df.drop(['outlier'], axis=1))
        df = encoder.inverse_transform(df)
        return isf, df

But I am getting the following error:-

Traceback (most recent call last):
File "train_script_new.py", line 127, in <module>
isf, prediction_df = build_model(dataset, no_of_estimators, maximum_samples, 
error_percent, maximum_features)
File "train_script_new.py", line 102, in build_model
isf.fit(df)
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 274, 
in fit
self._threshold_ = np.percentile(self.decision_function(X),
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 345, 
in decision_function
return self.score_samples(X) - self.offset_
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 403, 
in score_samples
depths += _average_path_length(n_samples_leaf)
File "/miniconda3/lib/python3.7/site-packages/sklearn/ensemble/iforest.py", line 448, 
in _average_path_length
average_path_length = np.zeros(n_samples_leaf.shape)
numpy.core._exceptions.MemoryError: Unable to allocate 75.2 GiB for an array with 
shape (1, 10086709500) and data type float64

The shape of the dataset i am passing is:- df.shape: (20226508, 5) but in the error it says (1, 10086709500) is this because something is happening internally in the algorithm which is creating this array?

These are the last 3 lines of the logs printed by the model:-

 Building estimator 500 of 500 for this parallel run (total 500)...
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  6.0min remaining: 0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  6.0min finished

How could I come out of this error? I have tried to train model with larger instance which didn't work. currently i am training with ml.m4.16xlarge instance. Please help.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source