Skip to content

How to sync Amazon Polly speech and time stamps?

0

Hey guys, I am creating a reader for Dutch eBooks. I generate the speech and time stamps in two separate requests.

My time stamps are always out of sync and progressively get worse as it approaches the ending sentences. Im guessing the problem is the neural voice engine's variability in output generation but there is unfortunately no way to do it one request.

Any advice on how I can sync time stamps with audio?

Here are the two requests.

# Generate Speech
        response = polly.synthesize_speech(
            Text=text, 
            TextType="ssml",
            OutputFormat="mp3",
            VoiceId="Laura",
            LanguageCode = "nl-NL",
            # LanguageCode = "en-GB",
            Engine="neural"
            )


# Generate time stamps by sentence
response = polly.synthesize_speech(
            Text=text, 
            TextType="ssml",
            OutputFormat="json",
            SpeechMarkTypes=["sentence"],
            VoiceId="Laura",
            LanguageCode = "nl-NL",
            Engine="neural"
            )
  • Can you provide input text where this reproduces? Speechmarks should align with the audio when all parameters are the same, regardless of the engine.

asked a year ago200 views
1 Answer
1

Unfortunately, it's not possible to generate both speech audio and speech marks (timestamps) in a single API call with Amazon Polly. This limitation can indeed lead to synchronization issues, especially when using neural voices which may have more variability in their output.

To improve synchronization between the audio and timestamps, you can try the following approaches:

  1. Use the same voice and engine settings for both requests. Ensure that you're using identical parameters (VoiceId, LanguageCode, Engine, etc.) for both the audio generation and speech marks generation calls.

  2. Consider using the standard voice engine instead of the neural engine. While the audio quality might be slightly lower, the standard voices tend to have more consistent timing, which could lead to better synchronization.

  3. If possible, break your text into smaller chunks and process them separately. This can help minimize the cumulative timing differences that occur over longer passages.

  4. Implement a post-processing step to adjust the timestamps based on the actual audio duration. You could use audio analysis tools to detect pauses or specific phonemes in the audio and align the timestamps accordingly.

  5. If your use case allows, you might consider using Amazon Polly's asynchronous synthesis functionality for longer texts. This method processes the entire text as a single unit, which might lead to more consistent timing between the audio and speech marks.

Remember that perfect synchronization may be challenging to achieve, especially with neural voices. You may need to implement some level of tolerance or adjustment in your application to account for slight timing discrepancies.
Sources
AWS Polly Speech Generation: Is it possible to produce speech plus speech marks in one call? | AWS re:Post
Amazon Polly Features
Long audio files - Amazon Polly

answered a year ago
EXPERT
reviewed a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.