'Difference between GroupSplitShuffle and GroupKFolds

As the title says, I want to know the difference between sklearn's GroupKFold and GroupShuffleSplit.

Both make train-test splits given for data that has a group ID, so the groups don't get separated in the split. I checked on one train/test set for each function and they both look like they make a pretty good stratification, but if someone could confirm that all splits do that, it would be great.

I made a test with both, for 10 splits:

gss = GroupShuffleSplit(n_splits=10, train_size=0.8, random_state=42)

 

for train_idx, test_idx in gss.split(X,y,groups):

    print("train:", train_idx, "test:", test_idx)

train: [ 1  2  3  4  5 11 12 13 14 15 16 17 19 20] test: [ 0  6  7  8  9 10 18]

train: [ 1  2  3  4  5  6  7  8  9 10 12 13 14 18 19 20] test: [ 0 11 15 16 17]

train: [ 0  1  3  4  5  6  7  8  9 10 12 13 14 18 19 20] test: [ 2 11 15 16 17]

train: [ 0  2  3  4 11 12 13 14 15 16 17 18 19 20] test: [ 1  5  6  7  8  9 10]

train: [ 0  1  3  4  5  6  7  8  9 10 11 15 16 17 19 20] test: [ 2 12 13 14 18]

train: [ 1  2  3  4  5  6  7  8  9 10 11 15 16 17 18] test: [ 0 12 13 14 19 20]

train: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17] test: [ 5 18 19 20]

train: [ 0  1  3  4  6  7  8  9 10 11 15 16 17 18 19 20] test: [ 2  5 12 13 14]

train: [ 0  1  3  4  5 12 13 14 15 16 17 18 19 20] test: [ 2  6  7  8  9 10 11]

train: [ 0  2  3  4  5 11 12 13 14 15 16 17 19 20] test: [ 1  6  7  8  9 10 18]

 

group_kfold = GroupKFold(n_splits=10)

 

for train_idx, test_idx in group_kfold.split(X,y,groups):

    print("train:", train_idx, "test:", test_idx)

train: [ 0  1  2  3  4  5 11 12 13 14 15 16 17 18 19 20] test: [ 6  7  8  9 10]

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 18 19 20] test: [15 16 17]

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 15 16 17 18 19 20] test: [12 13 14]

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18] test: [19 20]

train: [ 0  1  2  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [3 4]

train: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 19 20] test: [ 0 18]

train: [ 0  1  2  3  4  5  6  7  8  9 10 12 13 14 15 16 17 18 19 20] test: [11]

train: [ 0  1  2  3  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [5]

train: [ 0  1  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [2]

train: [ 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] test: [1]


Solution 1:[1]

The documentation for the un-Group versions make this clearer. KFold splits into k folds and then lumps those together into different train/test splits, whereas ShuffleSplit repeatedly makes the train/test splits directly. In particular, each sample is tested on exactly once in KFold, but can be tested on zero or multiple times in ShuffleSplit.

Solution 2:[2]

The documentation on scikit learn here gives a good overview of how both methods work. Here an illustration that eases the understanding from same documentation.

StratifiedShuffleSplitGroupKFold

Solution 3:[3]

I also ran this test with

    X = array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
    y = array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
    Groups = array([1., 1., 1., 1., 1., 2., 2., 2., 2., 2., 0., 0., 0., 0., 0., 3., 3., 3., 3., 3.])

The GSS gave me this result:

      train: [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] test: [0 1 2 3 4]
      train: [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] test: [0 1 2 3 4]
      train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] test: [15 16 17 18 19]
      train: [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] test: [0 1 2 3 4]
      train: [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19] test: [5 6 7 8 9]
      train: [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19] test: [10 11 12 13 14]
      train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] test: [15 16 17 18 19]
      train: [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19] test: [10 11 12 13 14]
      train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] test: [15 16 17 18 19]
      train: [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19] test: [10 11 12 13 14]

Looks like groups are preserved. always have the same size (contrary to your example) and groups can occur multiple times in different test selections.

GKF result:

train: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] test: [15 16 17 18 19]
train: [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19] test: [5 6 7 8 9]
train: [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] test: [0 1 2 3 4]
train: [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19] test: [10 11 12 13 14]

Every group can only occur once in test, so the max of the n_splits argument is constrained to 4 here because I only have 4 groups.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ben Reiniger
Solution 2 CVname
Solution 3