'I want to feed videos as well as its annotations as training data to tensorflow model to hopefully get better results

I am training a model to detect drones in videos obtained from a security feed. The dataset is videos of drones flying in front of a camera and a file with its annotations in the following format (index, frame_number, no_of_objects, X_co-ordinate, Y_co-ordinate, width, height, class).

I am aware I can train the model using only the frames as data and the no_of_objects column as value but, I want to utilize the data provided to tell the model where exactly in the entire frame the drone currently is.

Do I need to design a custom model or there is some existing library that accepts coordinates as arguments? If the approach I am currently looking at is not optimal please let me know.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source