AWS Polly ignoring SSML parameter <prosody duration> causing audio synchronization issues.

0

I have been working on a process to generate translated audio for a training video. The steps are:

  1. Create subtitle file using speech to text.
  2. Create second SRT with continuous (zero break time between lines) and adjust duration to maintain total length.
  3. Confirm both original and adjusted SRT is in sync with original audio
  4. Translate subtitles
  5. Confirm sync with audio
  6. Convert to SSML
  7. Convert to audio using text to speech
  8. Pull hair out because audio is not synced

My issue is that the final audio is not synchronized with the original. If I use the original SRT the audio is short by 5 mins for both Neural and Standard processing. The SSML looks like this: <prosody duration="3171ms">Our agenda. We will go over the application and then.</prosody><break time="9ms"/> <prosody duration="2230ms">I will show you how to navigate through it.</prosody><break time="300ms"/> <prosody duration="2150ms">I will show you how to report an incident</prosody><break time="280ms"/> This is not too bad and a workaround may be to stretch the SRT to add 5mins and hopefully minimize the sync issues throughout the audio. But this will likely make the translations worse since the timing is stuck on English sentence length.

If I use the preferred method of using duration only in order to compensate for variable sentence lengths after the translation, the audio is shorter by 26 minutes which is unusable The SSML looks like this: <prosody duration="3180ms">Our agenda. We will go over the application and then</prosody> <prosody duration="2530ms">I will show you how to navigate through it.</prosody> <prosody duration="2430ms">I will show you how to report an incident,</prosody>

For both SSML files the total time, duration + break, is correct. This means that the text to speech is ignoring the duration parameter.

Does Polly completely ignore the duration parameter? The total break time adds up to the 26 mins, which means that the break time is likely being correctly read but not the duration.

Is there anyway to get a better result?

Thanks

asked 6 months ago191 views
1 Answer
1

Hi. Polly does not support duration attribute of <prosody> tag - please consult the documentation: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#prosody-tag

It does however support rate, which you can use to slow down or speed up the synthesis - the percentage format should hopefully give you enough precision to achieve what you need.

You can also consider using <prosody amazon:max-duration>, but it's only available for Standard voices: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#maxduration-tag

AWS
TB
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions