'iOS (Swift) Speech Transcription - capturing last word/number during continuous transcription
I am struggling a bit with a Speech app I am working on. I followed the dev example for creating a Speech Recognizer on Apple's developer website here, and my code is below. It is working well, and I am getting continuous recognition as expected.
However, my app idea requires me to get capture each number in a series of numbers as they are spoken. With the code below, I can successfully speak a long series of numbers (e.g. "2, 5, 3, 7, 10, 6...") and once I stop it will eventually return an SFTranscription array with transcriptions holding segments for each number I spoke. The reason I say eventually, is because the speech recognizer is constantly trying to determine an intelligible response in human language or formats (in this case, phone numbers, larger multi digit numbers, etc.), which is what it should do for dictation and human language. But I would like to get each word (number) spoken as it is said before it tries to make sense of it. Is there a way to grab the last word before the recognizer attempts to relate it to all other words prior?
import UIKit
import Speech
public class ViewController: UIViewController, SFSpeechRecognizerDelegate {
private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?
private let audioEngine = AVAudioEngine()
@IBOutlet var textView: UITextView!
@IBOutlet var recordButton: UIButton!
private var isListening = false
public override func viewDidLoad() {
super.viewDidLoad()
recordButton.isEnabled = false
textView.isEditable = false
}
override public func viewDidAppear(_ animated: Bool) {
super.viewDidAppear(animated)
speechRecognizer.delegate = self
SFSpeechRecognizer.requestAuthorization { authStatus in
OperationQueue.main.addOperation {
switch authStatus {
case .authorized:
self.recordButton.isEnabled = true
case .denied:
self.recordButton.isEnabled = false
self.recordButton.setTitle("User denied access to speech recognition", for: .disabled)
case .restricted:
self.recordButton.isEnabled = false
self.recordButton.setTitle("Speech recognition restricted on this device", for: .disabled)
case .notDetermined:
self.recordButton.isEnabled = false
self.recordButton.setTitle("Speech recognition not yet authorized", for: .disabled)
default:
self.recordButton.isEnabled = false
}
}
}
}
private func startRecording() throws {
recognitionTask?.cancel()
self.recognitionTask = nil
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
let inputNode = audioEngine.inputNode
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
recognitionRequest.shouldReportPartialResults = true
if #available(iOS 13, *) {
recognitionRequest.requiresOnDeviceRecognition = false
}
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
var isFinal = false
if let result = result {
if self.isListening {
result.transcriptions.forEach { transcription in // Grab SFTranscription from result
transcription.segments.forEach { segment in
print( segment.substring )
}
}
print("---")
}
isFinal = result.isFinal
}
if error != nil || isFinal {
self.audioEngine.stop()
inputNode.removeTap(onBus: 0)
self.recognitionRequest = nil
self.recognitionTask = nil
self.recordButton.isEnabled = true
self.recordButton.setTitle("Start Recording", for: [])
}
}
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
self.recognitionRequest?.append(buffer)
}
audioEngine.prepare()
try audioEngine.start()
textView.text = "(Go ahead, I'm listening)"
}
public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
if available {
recordButton.isEnabled = true
recordButton.setTitle("Start Recording", for: [])
} else {
recordButton.isEnabled = false
recordButton.setTitle("Recognition Not Available", for: .disabled)
}
}
@IBAction func recordButtonTapped() {
if audioEngine.isRunning {
audioEngine.stop()
recognitionRequest?.endAudio()
recordButton.isEnabled = false
recordButton.setTitle("Stopping", for: .disabled)
self.isListening = false
} else {
do {
try startRecording()
recordButton.setTitle("Stop Recording", for: [])
self.isListening = true
} catch {
recordButton.setTitle("Recording Not Available", for: [])
}
}
}
}
Example output for saying "4, 7, 5, 5, 4, 3" - each block after "---" represents all segments in one returned transcript.
For
---
For
seven
---
47
---
475
---
4
7
5
---
4755
---
47554
---
475543
---
475543
---
4
7
5
5
4
3
I can handle the spelled out responses (e.g. "For" for “4”) pretty easily with a function, but the long concatenated string numbers are whats fouling me up. I want to grab them before they get concatenated, and not have to wait until the very end when it eventually separates them into individual segments.
Thanks for any help!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|