'How do I compare two voice samples on iOS?

First of all I'd like to state that my question is not per say about the "classic" definition of voice recognition.

What we are trying to do is somewhat different, in the sense of:

  1. User records his command
  2. Later, when the user will speak pre-recorded command, a certain action will occur.

For example, I record a voice command for calling my mom, so I click on her and say "Mom". Then when I use the program and say "Mom", it will automatically call her.

How would I perform the comparison of a spoken command to a saved voice sample?

EDIT: We have no need for any "text-to-speech" abilities, solely a comparison of sound signals. Obviously we're looking for some sort of a off-the-shelf product or framework.



Solution 1:[1]

One way this is done for music recognition is to take a time sequence of frequency spectrums (time windowed STFT FFTs) for the two sounds in question, map the locations of the frequency peaks over the time axis, and cross-correlate the two 2D time-frequency peak mappings for a match. This is far more robust than just cross-correlating the 2 sound samples, as the peaks change far less than all the spectral "cruft" between the spectral peaks. This method will work better if the rate of the two utterances and their pitch haven't changed too much.

In iOS 4.x, you can use the Accelerate framework for the FFTs and maybe the 2D cross correlations as well.

Solution 2:[2]

Try using a third-party library, like OpenEars for iOS applications. You could have users record a voice sample and save it as translated text, or just let them enter text for recognition.

Solution 3:[3]

I think you'd have to perform some sort of cross correlation to determine how similar these two signals are. (Assuming it'll be the same user that is speaking ofcourse). I'm just typing this answer out to see if it helps, but I'd wait for a better answer from someone else though. My signal processing skills are close to zero.

Solution 4:[4]

I'm not sure if your question is about the DSP or how to do it on the iPhone. If it is the latter I would start with the Speak Here project that Apple provides. That way you already have the interface to record the voice to a file done. It will save you a lot of trouble.

Solution 5:[5]

I'm using Visqol for this purpose. The docs say it works best with a short sample, ideally 5-10 sec.You also need to prepare the files in terms of sample rate and they need to be .wav files. You can easily convert your files to the desired format with ffmpeg library. https://github.com/google/visqol

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Dominic
Solution 3 Tejaswi Yerukalapudi
Solution 4 Eric Brotto
Solution 5 eva