'pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:

df = Val1 Val2 Val3 Id
      2     2   8    b
      1     2   3    a
      5     7   8    z
      5     1   4    a
      0     9   0    c
      3     1   3    b
      2     7   5    z
      7     2   8    c
      6     5   5    d
...
      5     1   8    a
      4     9   0    z
      1     8   2    z

I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set. So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.

I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test. I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.

What is the best way to do so?



Solution 1:[1]

As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.

Solution 2:[2]

As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."

Do check out the documentation in future first.

Solution 3:[3]

As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thomaskolasa
Solution 2 Ni Yi Puay
Solution 3