- Newest
- Most votes
- Most comments
Unfortunately, it's not possible to generate both speech audio and speech marks (timestamps) in a single API call with Amazon Polly. This limitation can indeed lead to synchronization issues, especially when using neural voices which may have more variability in their output.
To improve synchronization between the audio and timestamps, you can try the following approaches:
-
Use the same voice and engine settings for both requests. Ensure that you're using identical parameters (VoiceId, LanguageCode, Engine, etc.) for both the audio generation and speech marks generation calls.
-
Consider using the standard voice engine instead of the neural engine. While the audio quality might be slightly lower, the standard voices tend to have more consistent timing, which could lead to better synchronization.
-
If possible, break your text into smaller chunks and process them separately. This can help minimize the cumulative timing differences that occur over longer passages.
-
Implement a post-processing step to adjust the timestamps based on the actual audio duration. You could use audio analysis tools to detect pauses or specific phonemes in the audio and align the timestamps accordingly.
-
If your use case allows, you might consider using Amazon Polly's asynchronous synthesis functionality for longer texts. This method processes the entire text as a single unit, which might lead to more consistent timing between the audio and speech marks.
Remember that perfect synchronization may be challenging to achieve, especially with neural voices. You may need to implement some level of tolerance or adjustment in your application to account for slight timing discrepancies.
Sources
AWS Polly Speech Generation: Is it possible to produce speech plus speech marks in one call? | AWS re:Post
Amazon Polly Features
Long audio files - Amazon Polly
Relevant content
- asked 2 years ago
- asked a year ago
- AWS OFFICIALUpdated 5 years ago

Can you provide input text where this reproduces? Speechmarks should align with the audio when all parameters are the same, regardless of the engine.