'iOS (Swift) Speech Transcription - capturing last word/number during continuous transcription

I am struggling a bit with a Speech app I am working on. I followed the dev example for creating a Speech Recognizer on Apple's developer website here, and my code is below. It is working well, and I am getting continuous recognition as expected.

However, my app idea requires me to get capture each number in a series of numbers as they are spoken. With the code below, I can successfully speak a long series of numbers (e.g. "2, 5, 3, 7, 10, 6...") and once I stop it will eventually return an SFTranscription array with transcriptions holding segments for each number I spoke. The reason I say eventually, is because the speech recognizer is constantly trying to determine an intelligible response in human language or formats (in this case, phone numbers, larger multi digit numbers, etc.), which is what it should do for dictation and human language. But I would like to get each word (number) spoken as it is said before it tries to make sense of it. Is there a way to grab the last word before the recognizer attempts to relate it to all other words prior?

import UIKit
import Speech

public class ViewController: UIViewController, SFSpeechRecognizerDelegate {
      
    private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))!
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var recognitionTask: SFSpeechRecognitionTask?
    private let audioEngine = AVAudioEngine()
    @IBOutlet var textView: UITextView!
    @IBOutlet var recordButton: UIButton!
    
    private var isListening = false
    
    public override func viewDidLoad() {
        super.viewDidLoad()
            recordButton.isEnabled = false
        textView.isEditable = false
    }
    
    override public func viewDidAppear(_ animated: Bool) {
        super.viewDidAppear(animated)
        speechRecognizer.delegate = self
        SFSpeechRecognizer.requestAuthorization { authStatus in
            OperationQueue.main.addOperation {
                switch authStatus {
                case .authorized:
                    self.recordButton.isEnabled = true
                case .denied:
                    self.recordButton.isEnabled = false
                    self.recordButton.setTitle("User denied access to speech recognition", for: .disabled)
                case .restricted:
                    self.recordButton.isEnabled = false
                    self.recordButton.setTitle("Speech recognition restricted on this device", for: .disabled)
                case .notDetermined:
                    self.recordButton.isEnabled = false
                    self.recordButton.setTitle("Speech recognition not yet authorized", for: .disabled)
                default:
                    self.recordButton.isEnabled = false
                }
            }
        }
    }
    
    private func startRecording() throws {
        recognitionTask?.cancel()
        self.recognitionTask = nil
        let audioSession = AVAudioSession.sharedInstance()
        try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        let inputNode = audioEngine.inputNode
        recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
        recognitionRequest.shouldReportPartialResults = true
        if #available(iOS 13, *) {
            recognitionRequest.requiresOnDeviceRecognition = false
        }
        recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
            var isFinal = false
            if let result = result {
                
                if self.isListening {
                    result.transcriptions.forEach { transcription in  // Grab SFTranscription from result
                        transcription.segments.forEach { segment in
                            print( segment.substring )
                        }
                    }
                    print("---")
                }

                isFinal = result.isFinal
            }
            if error != nil || isFinal {
                self.audioEngine.stop()
                inputNode.removeTap(onBus: 0)
                self.recognitionRequest = nil
                self.recognitionTask = nil
                self.recordButton.isEnabled = true
                self.recordButton.setTitle("Start Recording", for: [])
            }
        }
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
            self.recognitionRequest?.append(buffer)
        }
        audioEngine.prepare()
        try audioEngine.start()
        textView.text = "(Go ahead, I'm listening)"
    }
    public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
        if available {
            recordButton.isEnabled = true
            recordButton.setTitle("Start Recording", for: [])
        } else {
            recordButton.isEnabled = false
            recordButton.setTitle("Recognition Not Available", for: .disabled)
        }
    }
    @IBAction func recordButtonTapped() {
        if audioEngine.isRunning {
            audioEngine.stop()
            recognitionRequest?.endAudio()
            recordButton.isEnabled = false
            recordButton.setTitle("Stopping", for: .disabled)
            self.isListening = false
        } else {
            do {
                try startRecording()
                recordButton.setTitle("Stop Recording", for: [])
                self.isListening = true
            } catch {
                recordButton.setTitle("Recording Not Available", for: [])
            }
        }
    }
}

Example output for saying "4, 7, 5, 5, 4, 3" - each block after "---" represents all segments in one returned transcript.

For
---
For
seven
---
47
---
475
---
4
7
5
---
4755
---
47554
---
475543
---
475543
---
4
7
5
5
4
3

I can handle the spelled out responses (e.g. "For" for “4”) pretty easily with a function, but the long concatenated string numbers are whats fouling me up. I want to grab them before they get concatenated, and not have to wait until the very end when it eventually separates them into individual segments.

Thanks for any help!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source