'StratifiedKFold vs KFold in scikit-learn
I use this code to test KFold and StratifiedKFold.
import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold
X = np.array([
[1,2,3,4],
[11,12,13,14],
[21,22,23,24],
[31,32,33,34],
[41,42,43,44],
[51,52,53,54],
[61,62,63,64],
[71,72,73,74]
])
y = np.array([0,0,0,0,1,1,1,1])
sfolder = StratifiedKFold(n_splits=4,random_state=0,shuffle=False)
floder = KFold(n_splits=4,random_state=0,shuffle=False)
for train, test in sfolder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print("StratifiedKFold done")
for train, test in floder.split(X,y):
print('Train: %s | test: %s' % (train, test))
print("KFold done")
I found that StratifiedKFold can keep the proportion of labels, but KFold can't.
Train: [1 2 3 5 6 7] | test: [0 4]
Train: [0 2 3 4 6 7] | test: [1 5]
Train: [0 1 3 4 5 7] | test: [2 6]
Train: [0 1 2 4 5 6] | test: [3 7]
StratifiedKFold done
Train: [2 3 4 5 6 7] | test: [0 1]
Train: [0 1 4 5 6 7] | test: [2 3]
Train: [0 1 2 3 6 7] | test: [4 5]
Train: [0 1 2 3 4 5] | test: [6 7]
KFold done
It seems that StratifiedKFold is better, so should KFold not be used?
When to use KFold instead of StratifiedKFold?
Solution 1:[1]
Assume Classification problem, Having 3 class(A,B,C) to predict.
Class No_of_instance
A 50
B 50
C 50
**StratifiedKFold**
If data-set is divided into 5 fold. Then each fold will contains 10 instance from each class, i.e. no of instance per class is equal and follow uniform distribution.
**KFold**
it will randomly took 30 instance and no of instance per class may or may not be equal or uniform.
**When to use**
Classification task use StratifiedKFold, and regression task use Kfold .
But if dataset contains large number of instance, both StratifiedKFold and Kfold can be used in classification task.
Solution 2:[2]
StratifiedKFold: This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class
KFold: Split dataset into k consecutive folds.
StratifiedKFold is used when is need to balance of percentage each class in train & test. If not required KFOld is used.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ankit Singh |
| Solution 2 | ashraful16 |
