'RFECV machine learning feature selection taking far too long Python
I am relatively new to SKLearn and have a question about Feature Selection.
I am trying to build an SVM model, with my data having around 30 features all of about 10k data points, and I'm currently trying to eliminate as many useless features as I can. I have first dropped features that are highly correlated to other features, and now want to use RFECV to optimize the remaining ones.
To begin with, I found this code on the sklearn website, I have a couple of issues with it and wondered if anyone could help.
X = df.drop(['label'], axis=1)
y = df['label']
rfe = RFECV(SVR(kernel='linear'), step=1, scoring='accuracy')
rfe.fit(X, y)
print(rfe.ranking_)
Firstly, if I run this as it is it takes forever to run, I've left it for ages and I haven't actually managed to get it to complete yet. However, if I remove kernel='linear' it runs reasonably quickly, but then produces an error message, which appears to come from rfe.fit(X, y):
RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
I've cut my features down to about 10, just to see if I can speed things up as a test, and have also played around with the step variable, but with the kernel='linear' variable in there nothing seems to help speed it up, it just runs for hours without doing anything. All I want is a list of features to use using the RFECV method. Does anyone have any suggestions for what I'm doing wrong or what I can do to speed things up?
Many thanks
Solution 1:[1]
- The reason it runs quickly when you remove "kernel='linear'", is because it fails quickly.
- Only SVR(kernel='linear') returns a coef_ that can be used by RFECV. If you use any other kernel, no coef_ is returned thus RFECV cannot work with it.
- By setting "step=1", you're forcing RFECV(SVR(kernel='linear'), step=1, scoring='accuracy') to fit on all n features, exclude the one with the lowest coefficient, fit again on n-1 features, exclude the one with the lowest coefficient again, etc. This is time consuming.
Try this to speed up the process:
RFECV(SVR(kernel='linear'), step=5, scoring='accuracy', min_features_to_select = 10)
This should be significantly faster. Adjust "step" and "min_features_to_select" to your specific needs.
Solution 2:[2]
SVR models work better with scaled/normalized data. Try this:
from sklearn.preprocessing import MinMaxScaler
X = df.drop(['label'], axis=1)
X_norm = MinMaxScaler().fit_transform(X)
y = df['label']
rfe = RFECV(SVR(kernel='linear'), step=1, scoring='accuracy')
rfe.fit(X_norm, y)
print(rfe.ranking_)
This should speed things up quite a bit from my personal experience.
Alternatively, I believe you can also use standard scaler for scaling your features: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Elias Laura Wiegandt |
| Solution 2 | Chris |
