'Object Detection without annotations and labels

Problem Statement:

I am given 2 sets of images. All the images in both sets are without annotations and labels.

First set : a set of images of the grocery store shelves (captured in the grocery stores).

Second set: a set of close-up images of the products kept on those store shelves.

What I am trying to achieve:

I want to first locate and then predict a bounding box Product for a Product in the set of images of Grocery shelves (first set), given a separate set of the Product images (second set)

Visually:

Input Product image Product image

Output Corresponding Shelf image Shelf image

My approach:

  1. For each product image, first find all the shelf image(s) which contain that product.
  2. Then predict a bounding box by finding the location of the product in the shelf image.

I am using YOLOv5 for this task but I am not sure how should I start off given that I have to do it without annotations or labels.

I have come across terms like Zero-Shot learning, Self-Supervised Object Detection, etc. but I haven't been able to figure out their use as a starting point.

There's a similar question asked but I am not sure the answer to it solves the problem.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source