'How to split data on balanced training set and test set on sklearn
I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)
but it gives unbalanced dataset! Any suggestion.
Solution 1:[1]
You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
X_train=X[train_idx]
y_train=y[train_idx]
print(X_train)
# [[3 7]
# [2 4]]
print(y_train)
# [1 0]
Solution 2:[2]
Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.
So you could do:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)
The trick here is that it starts from version 0.17 in sklearn.
From the documentation about the parameter stratify:
stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting
Solution 3:[3]
If the classes are not balanced but you want the split to be balanced, then stratifying isn't going to help. There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you:
def split_balanced(data, target, test_size=0.2):
classes = np.unique(target)
# can give test_size as fraction of input data size of number of samples
if test_size<1:
n_test = np.round(len(target)*test_size)
else:
n_test = test_size
n_train = max(0,len(target)-n_test)
n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
n_test_per_class = max(1,int(np.floor(n_test/len(classes))))
ixs = []
for cl in classes:
if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
# if data has too few samples for this class, do upsampling
# split the data to training and testing before sampling so data points won't be
# shared among training and test data
splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
else:
ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
replace=False))
# take same num of samples from all classes
ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])
X_train = data[ix_train,:]
X_test = data[ix_test,:]
y_train = target[ix_train]
y_test = target[ix_test]
return X_train, X_test, y_train, y_test
Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement). As a result, some data points will appear multiple times and this may have an effect on the accuracy measures etc. And if some class has only one data point, there will be an error. You can easily check the numbers of points per class for example with np.unique(target, return_counts=True)
Solution 4:[4]
Another approach is to over- or under- sample from your stratified test/train split. The imbalanced-learn library is quite handy for this, specially useful if you are doing online learning & want to guarantee balanced train data within your pipelines.
from imblearn.pipeline import Pipeline as ImbalancePipeline
model = ImbalancePipeline(steps=[
('data_balancer', RandomOverSampler()),
('classifier', SVC()),
])
Solution 5:[5]
This is my implementation that I use to get train/test data indexes
def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None):
classes, counts = np.unique(target, return_counts=True)
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
if nPerClass > np.min(counts):
print("Insufficient data to produce a balanced training data split.")
print("Classes found %s"%classes)
print("Classes count %s"%counts)
ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target))
print("trainSize is reset from %s to %s"%(trainSize, ts))
trainSize = ts
nPerClass = float(len(target))*float(trainSize)/float(len(classes))
# get number of classes
nPerClass = int(nPerClass)
print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass ))
# get indexes
trainIndexes = []
for c in classes:
if seed is not None:
np.random.seed(seed)
cIdxs = np.where(target==c)[0]
cIdxs = np.random.choice(cIdxs, nPerClass, replace=False)
trainIndexes.extend(cIdxs)
# get test indexes
testIndexes = None
if getTestIndexes:
testIndexes = list(set(range(len(target))) - set(trainIndexes))
# shuffle
if shuffle:
trainIndexes = random.shuffle(trainIndexes)
if testIndexes is not None:
testIndexes = random.shuffle(testIndexes)
# return indexes
return trainIndexes, testIndexes
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ptyshevs |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | eliangius |
| Solution 5 | Cobry |
