'RFECV machine learning feature selection taking far too long Python

I am relatively new to SKLearn and have a question about Feature Selection.

I am trying to build an SVM model, with my data having around 30 features all of about 10k data points, and I'm currently trying to eliminate as many useless features as I can. I have first dropped features that are highly correlated to other features, and now want to use RFECV to optimize the remaining ones.

To begin with, I found this code on the sklearn website, I have a couple of issues with it and wondered if anyone could help.

X = df.drop(['label'], axis=1)
y = df['label']

rfe = RFECV(SVR(kernel='linear'), step=1, scoring='accuracy')
rfe.fit(X, y)
print(rfe.ranking_)

Firstly, if I run this as it is it takes forever to run, I've left it for ages and I haven't actually managed to get it to complete yet. However, if I remove kernel='linear' it runs reasonably quickly, but then produces an error message, which appears to come from rfe.fit(X, y):

RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes

I've cut my features down to about 10, just to see if I can speed things up as a test, and have also played around with the step variable, but with the kernel='linear' variable in there nothing seems to help speed it up, it just runs for hours without doing anything. All I want is a list of features to use using the RFECV method. Does anyone have any suggestions for what I'm doing wrong or what I can do to speed things up?

Many thanks



Solution 1:[1]

  1. The reason it runs quickly when you remove "kernel='linear'", is because it fails quickly.
  2. Only SVR(kernel='linear') returns a coef_ that can be used by RFECV. If you use any other kernel, no coef_ is returned thus RFECV cannot work with it.
  3. By setting "step=1", you're forcing RFECV(SVR(kernel='linear'), step=1, scoring='accuracy') to fit on all n features, exclude the one with the lowest coefficient, fit again on n-1 features, exclude the one with the lowest coefficient again, etc. This is time consuming.

Try this to speed up the process:

RFECV(SVR(kernel='linear'), step=5, scoring='accuracy', min_features_to_select = 10)

This should be significantly faster. Adjust "step" and "min_features_to_select" to your specific needs.

Solution 2:[2]

SVR models work better with scaled/normalized data. Try this:

from sklearn.preprocessing import MinMaxScaler

X = df.drop(['label'], axis=1)
X_norm = MinMaxScaler().fit_transform(X)
y = df['label']

rfe = RFECV(SVR(kernel='linear'), step=1, scoring='accuracy')
rfe.fit(X_norm, y)
print(rfe.ranking_)

This should speed things up quite a bit from my personal experience.

Alternatively, I believe you can also use standard scaler for scaling your features: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Elias Laura Wiegandt
Solution 2 Chris