AWS Polly ignoring SSML parameter <prosody duration> causing audio synchronization issues.

0

I have been working on a process to generate translated audio for a training video. The steps are:

  1. Create subtitle file using speech to text.
  2. Create second SRT with continuous (zero break time between lines) and adjust duration to maintain total length.
  3. Confirm both original and adjusted SRT is in sync with original audio
  4. Translate subtitles
  5. Confirm sync with audio
  6. Convert to SSML
  7. Convert to audio using text to speech
  8. Pull hair out because audio is not synced

My issue is that the final audio is not synchronized with the original. If I use the original SRT the audio is short by 5 mins for both Neural and Standard processing. The SSML looks like this: <prosody duration="3171ms">Our agenda. We will go over the application and then.</prosody><break time="9ms"/> <prosody duration="2230ms">I will show you how to navigate through it.</prosody><break time="300ms"/> <prosody duration="2150ms">I will show you how to report an incident</prosody><break time="280ms"/> This is not too bad and a workaround may be to stretch the SRT to add 5mins and hopefully minimize the sync issues throughout the audio. But this will likely make the translations worse since the timing is stuck on English sentence length.

If I use the preferred method of using duration only in order to compensate for variable sentence lengths after the translation, the audio is shorter by 26 minutes which is unusable The SSML looks like this: <prosody duration="3180ms">Our agenda. We will go over the application and then</prosody> <prosody duration="2530ms">I will show you how to navigate through it.</prosody> <prosody duration="2430ms">I will show you how to report an incident,</prosody>

For both SSML files the total time, duration + break, is correct. This means that the text to speech is ignoring the duration parameter.

Does Polly completely ignore the duration parameter? The total break time adds up to the 26 mins, which means that the break time is likely being correctly read but not the duration.

Is there anyway to get a better result?

Thanks

질문됨 7달 전207회 조회
1개 답변
1

Hi. Polly does not support duration attribute of <prosody> tag - please consult the documentation: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#prosody-tag

It does however support rate, which you can use to slow down or speed up the synthesis - the percentage format should hopefully give you enough precision to achieve what you need.

You can also consider using <prosody amazon:max-duration>, but it's only available for Standard voices: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#maxduration-tag

AWS
TB
답변함 6달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠