'Creating appropriate dataset for tensorflow multilabel classification model: "ValueError: setting an array element with a sequence"

I'm trying to train a multiclass image classification model, based on this model:

https://tfhub.dev/google/imagenet/inception_v3/feature_vector/5

And taking some code from here:

https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/tf2_image_retraining.ipynb#scrollTo=umB5tswsfTEQ

My model is created like this:

model_url = "https://tfhub.dev/google/imagenet/mobilenet_v3_large_075_224/feature_vector/5"  

model = tf.keras.Sequential([
    # Explicitly define the input shape so the model can be properly
    # loaded by the TFLiteConverter
    tf.keras.layers.InputLayer(input_shape=IMAGE_SIZE + (3,)),
    hub.KerasLayer(model_url, trainable=do_fine_tuning),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(len(class_names),
                          kernel_regularizer=tf.keras.regularizers.l2(0.0001))
])

model.build((None,) + IMAGE_SIZE + (3,))
model.summary()

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.005, momentum=0.9),
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True, label_smoothing=0.1),
    metrics=['accuracy'])

My data is coming from a datagenerator, which is pulling data from a MongoDB database. The datagenerator __data_generation looks like:

def __data_generation(self, list_paths, list_paths_wo_ext):
    'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
    # Initialization
    num_items = min(self.batch_size, len(list_paths))
    x = np.empty((num_items, self.target_size[1], self.target_size[0], 3))

    y = np.empty((num_items,))

    # Generate data
    for i, ID in enumerate(list_paths):
        # Store sample
        img = Image.open(ID)  # type: Image
        img.load()  # required for png.split()

        if os.path.isfile(resize_cache_path):## Heading ##
            resized = Image.open(resize_cache_path)
            resized.load()
        else:
            resized = img.resize(self.target_size)
            resized.save(resize_cache_path)
        x[i, ] = resized
        y[i, ] = self.targets.loc[ID].values

    return x, y

On y[i, ] = self.targets.loc[ID].values I'm getting a ValueError however.

The plot thickened when I downwnloaded a sample dataset as used by the colab:

data_dir = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

IMAGE_SIZE = (224, 224)

data_dir = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)


def build_dataset(subset):
    return tf.keras.preprocessing.image_dataset_from_directory(
        data_dir,
        validation_split=.20,
        subset=subset,
        label_mode="categorical",
        # Seed needs to provided when using validation_split and shuffle = True.
        # A fixed seed is used so that the validation set is stable across runs.
        seed=123,
        image_size=IMAGE_SIZE,
        batch_size=1)
    
    
    train_ds = build_dataset("training")
    
    for x, y in train_ds.as_numpy_iterator():
        print(y)

This prints a result like:

[[0. 0. 1. 0. 0.]]

I didn't know how this applied to the multilabel scenario so I ran the following code:

for x, y in train_ds.as_numpy_iterator():
    total = sum(y[0])
    if total > 1:
        print(total)

There was no output, so it appears either I've misunderstood or this dataset simply isn't multilabel.

My question is, how can I create a custom multilabel dataset from my own database for consumption the model listed above?

Solution 1:^[1]

On y[i, ] = self.targets.loc[ID].values I'm getting a ValueError however.

What does the error say? setting an array element with a sequence? In that case you might want to initialize y explicitly as a matrix via y = np.empty((num_items, len(class_names))). self.targets.loc[ID].values should then be a vector of length len(class_names) which is 0 everywhere except for the indices that indicate the classes to which the input sample with ID belongs.

There was no output, so it appears either I've misunderstood or this dataset simply isn't multilabel.

The dataset is not multilabel since each input image is only assigned to one of the 5 flower classes (https://www.tensorflow.org/tutorials/load_data/images). Thus, there is no output in your loop since the sum of y[0] is always exactly 1.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	WGierke

'Creating appropriate dataset for tensorflow multilabel classification model: "ValueError: setting an array element with a sequence"

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]