'Speech to text Java web app for live caption possible? [closed]

This is regaarding Google Speech to text API:

API

I want to develop Spring Boot Java Web App:

  1. The app is launched in local host
  2. I open browser to http://localhost:8080
  3. The app displays simple UI, main window that display live realtime captions for any English audio comping from the laptop speaker which could be zoom video call in which participants are speaking and I hear them and I also see the live captions in my local web app
  4. Live captions remains on the screen in a window with scrollbar
  5. Live captions are saved in text file as new captions keep on appending in the text file

It is critical for the captions to have best accuracy and display captions quickly as the person is speaking.

Can this be achieved? If not possible with Google API, what is the alternative API?



Solution 1:[1]

If I am understanding you correctly, IMHO I would separate it into two parts

  1. Transcribe the speec to text, like below from google api

  2. and then do the caption as stream overlay

    //
    // Performs streaming speech recognition on raw PCM audio data.
    //
    // @param fileName the path to a PCM audio file to transcribe.
    //
    
    public static void streamingRecognizeFile(String fileName) throws Exception, IOException {
    Path path = Paths.get(fileName);
    byte[] data = Files.readAllBytes(path);
    
    // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
    try (SpeechClient speech = SpeechClient.create()) {
    
    // Configure request with local raw PCM audio
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(16000)
            .setModel("default")
            .build();
    StreamingRecognitionConfig config =
        StreamingRecognitionConfig.newBuilder().setConfig(recConfig).build();
    
    class ResponseApiStreamingObserver<T> implements ApiStreamObserver<T> {
      private final SettableFuture<List<T>> future = SettableFuture.create();
      private final List<T> messages = new java.util.ArrayList<T>();
    
      @Override
      public void onNext(T message) {
        messages.add(message);
      }
    
      @Override
      public void onError(Throwable t) {
        future.setException(t);
      }
    
      @Override
      public void onCompleted() {
        future.set(messages);
      }
    
      // Returns the SettableFuture object to get received messages / exceptions.
      public SettableFuture<List<T>> future() {
        return future;
      }
    }
    
    ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver =
        new ResponseApiStreamingObserver<>();
    
    BidiStreamingCallable<StreamingRecognizeRequest, StreamingRecognizeResponse> callable =
        speech.streamingRecognizeCallable();
    
    ApiStreamObserver<StreamingRecognizeRequest> requestObserver =
        callable.bidiStreamingCall(responseObserver);
    
    // The first request must **only** contain the audio configuration:
    requestObserver.onNext(
        StreamingRecognizeRequest.newBuilder().setStreamingConfig(config).build());
    
    // Subsequent requests must **only** contain the audio data.
    requestObserver.onNext(
        StreamingRecognizeRequest.newBuilder()
            .setAudioContent(ByteString.copyFrom(data))
            .build());
    
    // Mark transmission as completed after sending the data.
    requestObserver.onCompleted();
    
    List<StreamingRecognizeResponse> responses = responseObserver.future().get();
    
    for (StreamingRecognizeResponse response : responses) {
      // For streaming recognize, the results list has one is_final result (if available) followed
      // by a number of in-progress results (if iterim_results is true) for subsequent utterances.
      // Just print the first result here.
      StreamingRecognitionResult result = response.getResultsList().get(0);
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcript : %s\n", alternative.getTranscript());
      }
     }
    }
    

For your mobile Voice overlay

https://github.com/algolia/voice-overlay-android

For web HTML 5 overlay


<video id="video" controls preload="metadata">
   <source src="video/sintel-short.mp4" type="video/mp4">
   <source src="video/sintel-short.webm" type="video/webm">
   <track label="English" kind="subtitles" srclang="en" src="captions/vtt/sintel-en.vtt" default>
   <track label="Deutsch" kind="subtitles" srclang="de" src="captions/vtt/sintel-de.vtt">
   <track label="EspaƱol" kind="subtitles" srclang="es" src="captions/vtt/sintel-es.vtt">
</video>

    // per the sample linked above you can feed the /  append the captions
     var subtitlesMenu;
if (video.textTracks) {
   var df = document.createDocumentFragment();
   var subtitlesMenu = df.appendChild(document.createElement('ul'));
   subtitlesMenu.className = 'subtitles-menu';
   subtitlesMenu.appendChild(createMenuItem('subtitles-off', '', 'Off'));
   for (var i = 0; i < video.textTracks.length; i++) {
      subtitlesMenu.appendChild(createMenuItem('subtitles-' + video.textTracks[i].language, video.textTracks[i].language, video.textTracks[i].label));
   }
   videoContainer.appendChild(subtitlesMenu);
}

Solution 2:[2]

One of the fastest and most efficient ways to convert speech to text is Java Speech API (documentation at https://www.oracle.com/java/technologies/speech-api-frequently-asked-questions.html)

In the course of text conversion, you will need to break it down to pieces and because of this, the meaning may change slightly, since some expressions may have a different meaning than a single word, but this will help reduce the time of the final translation. Then send the already received segments (words, phrases) via API for translation.

You can choose several options you like (for example https://rapidapi.com/blog/best-translation-api/) and check which one will work faster. In my experience "Microsoft Translator Text" and "Google Translate" are some of the fastest. I also think that you will not be able to get instant translation, but if you test several API options and play around with whether to process all sentences, phrases or individual words at once, you can reduce the translation time to a minimum.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 qqNade