'How to perform cross validation stratifying by a feature while also performing feature selection?
Let's say I have the following data (it doesn't make sense in terms of classification, it's just to illustrate the case):
import pandas as pd
import numpy as np
size = 1000
data = pd.DataFrame()
data['group'] = np.random.choice(['a', 'b', 'c'], size=size)
data['feature_a'] = np.random.binomial(n=1, p=0.05, size=size)
data['feature_b'] = np.random.randn(size)
data['label'] = np.random.binomial(n=1, p=0.3, size=size)
Now, let's say I want to perform cross validation on this dataset to build a LogisticRegression model, using sklearn.
I want the training set sample to be stratified according to the column/feature group. So it would make sense to me to use the StratifiedKFold cv method, passing the y parameter during the .split() stage as data['group'].
That would be possible by using a loop and enumerating the generated splits. But let's say I also want to perform feature selection, using, for example, sklearn.feature_selection.RFECV. It has a cv parameter, where I can specify the CV method. However, it assumes I'll stratify by the label column.
So how can I achieve this "pipeline"? Feature selection using stratified cross validation, but stratifying by a feature, not the label/class column.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
