'Inspection of the feature importance in scikit-learn pipelines
I have defined the following pipelines using scikit-learn:
model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
model_dt = Pipeline([("preprocessing", StandardScaler()), ("classifier", DecisionTreeClassifier())])
model_gb = Pipeline([("preprocessing", StandardScaler()), ("classifier", HistGradientBoostingClassifier())])
Then I used cross validation to evaluate the performance of each model:
cv_results_lg = cross_validate(model_lg, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_dt = cross_validate(model_dt, data, target, cv=5, return_train_score=True, return_estimator=True)
cv_results_gb = cross_validate(model_gb, data, target, cv=5, return_train_score=True, return_estimator=True)
When I try to inspect the feature importance for each model using the coef_ method, it gives me an attribution error:
model_lg.steps[1][1].coef_
AttributeError: 'LogisticRegression' object has no attribute 'coef_'
model_dt.steps[1][1].coef_
AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'
model_gb.steps[1][1].coef_
AttributeError: 'HistGradientBoostingClassifier' object has no attribute 'coef_'
I was wondering, how I can fix this error? or is there any other approach to inspect the feature importance in each model?
Solution 1:[1]
Imo, the point here is the following. On the one side, the pipeline instances model_lg, model_dt etc. are not explicitely fitted (you're not calling method .fit() on them directly) and this prevents you from trying to access the coef_ attribute on the instances themselves.
On the other side, by calling .cross_validate() with parameter return_estimator=True (which is possible with .cross_validate() only among the cross-validation methods), you can get the fitted estimators back for each cv split, but you should access them via your dictionaries cv_results_lg, cv_results_dt etc (on the 'estimator' key). Here's the reference in the code and here's an example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_validate
X, y = load_iris(return_X_y=True)
model_lg = Pipeline([("preprocessing", StandardScaler()), ("classifier", LogisticRegression())])
cv_results_lg = cross_validate(model_lg, X, y, cv=5, return_train_score=True, return_estimator=True)
These would be - for instance - the results computed on the first fold.
cv_results_lg['estimator'][0].named_steps['classifier'].coef_
Useful insights on related topics might be found in:
Solution 2:[2]
make for loop in some algorithm and print accuracy
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Sohrab Mohammadi |
