'Object detection from synthetic to real life data with Yolov5

Currently trying yolov5 with custom synthetic data. The dataset we've created consists of 8 different objects. Each object has a minimum of 1500 pictures/labels, where the pictures are split 500/500/500 of normal/fog/distractors around object. Sample images from the dataset is in the first imgur link. The model is not trained from scratch, but from yolov5 standard .pt.

So far we've tried:

  • Adding more data (from 300 images per object, to 4500)
  • Creating more complex data (distractors on/around objects)
  • Running multiple runs of training
  • Trained with network size small, medium, large, xlarge
  • Different batch size between 4-32 (depending on model size)

Everything so far has resulted in good/great detection on synthetic data, but completely off when used on real-life data. Examples: Thinks that the whole pictures of unrelated objects is a paperbox, walls are pallets, etc. Quick sample images in the last imgur link.

Anyone got clues for how to improve the training or data to be better suited for real life detection? Or how to better interpret the results? I don't understand how the model draws the conclusion that a whole picture, with unrelated objects, is a box/pallet.

Results from training uploaded to imgur: https://imgur.com/a/P0TQeBl

Example on real life data: https://imgur.com/a/SGY7w8w



Solution 1:[1]

There are couple of things to improve results.

  1. After training your model with synthetic data, fine tune your model with real training data, with a smaller learning rate (1/10th maybe). This will reduce the gap between synthetic and real life images. In some cases rather than fine tuning, training the model with mixed (synthetic+real) produces better results.
  2. Generate images structurally similar to real life examples. For example, put humans inside forklifts, or pallets or barrels on forks, etc. Models learn from it.
  3. Randomize the texture on items that you want to detect. Models tend to focus on textures for detection. By randomizing textures, with lots of variability including mon natural occurrences, you force model to learn to identify objects not based on its textures. Although, texture of an object sometimes is a good identifier, synthetic data suffers from not replicating that feature good enough, hence the domain gap, so you reduce its impact on model decision.
  4. I am not sure whether the screenshot accurately represent your data generation distribution, if so, you have to randomize the angles of objects, sizes and occlusion amounts more.
  5. Use objects that you don’t want to detect but will be in the images you will do inference as distractors, rather than simple shapes like spheres.
  6. Randomize lighting more. Intensity, color, angles etc.
  7. Increase background and ground randomization. Use hdris, there are lots of free hdris
  8. Balance your dataset

https://imgur.com/a/LdCa8aO

Solution 2:[2]

Checking your results the answer is that your synthetic data is way to dissimilar to the real life data you want it to work for. Try to generate synthetic scenes that are closer to your real life counterparts and training again would clearly improve your results. That includes more realistic backgrounds and scene compositions. I don't know if your training set resembles the validation images you shared here but in case it does, try to have more objects per image, closer to the camera and add variation to their relative positions. Having just one random 3D object in the middle of an image is not going to provide good results. By the way, you are already overfitting your models, so more training images wouldn't help at this point.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Mr K.