'What's the utility of the audio embeddings from Google Audioset for audio classification?
I have extracted the audio embeddings from Google Audioset corpus (https://research.google.com/audioset/dataset/index.html). The audio embeddings contain a list of "bytes_lists" which is similar to the following
feature {
bytes_list {
value: "#\226]\006(N\223K\377\207\r\363\333\377\000Y\322v9\351\303\000\377\311\375\215E\342\377J\000\000_\000\370\222:\270\377\357\000\245\000\377\213jd\267\353\377J\033$\273\267\307\035\377\000\207\244Q\000\000\206\000\000\312\356<R\325g\303\356\016N\224\377\270\377\237\240\377\377\321\252j\357O\217\377\377,\330\000\377|\246\000\013\034\000\377\357\212\267\300b\000\000\000\251\236\000\233\035\000\326\377\327\327\377\377\223\0009{"
}
}
From the documentation and forum discussions, I learnt that these embeddings are the output of a pretrained model (MFCC+CNN) of the 10 second chunks of respective youtube videos. I have also learnt that these embeddings make it easy to work on deep learning models. How does it help the ML engineers?
My confusion is if these audio embeddings are already pre-trained, what are the utilities of these audio embeddings? i.e. How can I use these embeddings to train advanced models for performing Sound Event Detection?
An explanation of the build-up of the audioset corpus and it's utility would be much appreciated.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
