'Deep learning network to detect activity in videos

I'm currently working on a project to create a model that detects the type of activity two people in a video are doing. For example, let's consider a data set consisting of videos where the two people are boxing, wrestling, jumping, and even just talking with each other. How can I create a model to distinguish these activities from each other.

In particular, what kind of features should be extracted from the videos ? Should I use a pre-trained model with image-net weights etc. I am not sure where to begin. Thank you for your help!



Solution 1:[1]

I suggest starting reading academic papers to get a sense of it. This might be a good start; though it might be a little bit overwhelming if you are a beginner.

if you want to start exploring a collection of related papers/repositories, this is suggested: link.

If you want to start hands-on: this repository might be a good choice, it is advanced a collection of related implementations for such tasks: link

Here is also some benchmark that might be useful to follow: link

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sadra