'RNN training based on images

I am new to learning and training neural networks. I have the task of recognizing emotions based on voice data (audio data). I'm trying to create the simplest recurrent network in Matlab. I am submitting images to the input in the form of spectrograms. However, the accuracy of the network is about 40%. In fact, I understand that the result obtained is not acceptable. I ask you for advice and recommendations on how to dig in which direction to achieve a more significant result.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source