'How to split folders to 3 datasets with ImageDataGenerator?

validation_split parameter is able to allow ImageDataGenerator to split the data sets reading from the folder into 2 different disjoint sets. Is there any way to create 3 sets - of training, validation, and evaluation datasets using it?

I am thinking about splitting the dataset into 2 datasets, then splitting the 2nd dataset into another 2 datasets

datagen = ImageDataGenerator(validation_split=0.5, rescale=1./255)

train_generator = datagen.flow_from_directory(
    TRAIN_DIR, 
    subset='training'
)

val_generator = datagen.flow_from_directory(
    TRAIN_DIR,
    subset='validation'
)

Here I am thinking about splitting the validation dataset into 2 sets using val_generator. One for validation and the other for evaluation? How should I do it?



Solution 1:[1]

I like working with the flow_from_dataframe() method of ImageDataGenerator, where I interact with a simple Pandas DataFrame (perhaps containig other features), not with the directory. But you can easily change my code if you insist on flow_from_directory().

So this is my go-to function, e.g. for a regression task, where we try to predict a continuous y:

def get_generators(train_samp, test_samp, validation_split = 0.1):
    train_datagen = ImageDataGenerator(validation_split=validation_split, rescale = 1. / 255)
    test_datagen = ImageDataGenerator(rescale = 1. / 255)
    
    train_generator = train_datagen.flow_from_dataframe(
        dataframe = images_df[images_df.index.isin(train_samp)],
        directory = images_dir,
        x_col = 'img_file',
        y_col = 'y',
        target_size = (IMG_HEIGHT, IMG_WIDTH),
        class_mode = 'raw',
        batch_size = batch_size,
        shuffle = True,
        subset = 'training',
        validate_filenames = False
    )
    valid_generator = train_datagen.flow_from_dataframe(
        dataframe = images_df[images_df.index.isin(train_samp)],
        directory = images_dir,
        x_col = 'img_file',
        y_col = 'y',
        target_size = (IMG_HEIGHT, IMG_WIDTH),
        class_mode = 'raw',
        batch_size = batch_size,
        shuffle = False,
        subset = 'validation',
        validate_filenames = False
    )

    test_generator = test_datagen.flow_from_dataframe(
        dataframe = images_df[images_df.index.isin(test_samp)],
        directory = images_dir,
        x_col = 'img_file',
        y_col = 'y',
        target_size = (IMG_HEIGHT, IMG_WIDTH),
        class_mode = 'raw',
        batch_size = batch_size,
        shuffle = False,
        validate_filenames = False
    )
    return train_generator, valid_generator, test_generator

Things to notice:

  • I use two generators
  • The input to the function are the train/test indices (such as received from Sklearn's train_test_split) which are used to filter the DataFrame index.
  • The function also take a validation_split parameter for the training generator
  • images_df is a DataFrame somewhere in global memory with proper columns like img_file and y.
  • No need to shuffle validation and test generators

This can be further generalized for multiple outputs, classification, what have you.

Solution 2:[2]

I mostly have been splitting data in 80/10/10 for training, validation and test respectivelly.

When working with keras I favor the tf.data API as it provides a good abstraction for complex input pipelines

It does not provide a simple tf.data.DataSet.split functionality though

I have this function (that I found from someone's code and my source is missing) which I consistently use

def get_dataset_partitions_tf(ds: tf.data.Dataset, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
    assert (train_split + test_split + val_split) == 1

    if shuffle:
    # Specify seed to always have the same split distribution between runs
        ds = ds.shuffle(shuffle_size, seed=12)

    train_size = int(train_split * ds_size)
    val_size = int(val_split * ds_size)

    train_ds = ds.take(train_size)    
    val_ds = ds.skip(train_size).take(val_size)
    test_ds = ds.skip(train_size).skip(val_size)

    return train_ds, val_ds, test_ds

Firstly read your data set, and get its size(with cardianlity method), then pass it into the function and you're good to go!

This function can be given a flag to shuffle the original dataset before creating the splits, this is useful to have more realistic validation and test metrics.

The seed for shuffling is fixed so that we can run the same function and the splits remain the same, which we want for consistent results.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Giora Simchoni
Solution 2 bjornaer