Skip to content

Amazon Polly Multi-Voice Synthesis in Single SSML Document - Complex Example Request

0

I'm exploring Amazon Polly's capabilities with SSML for creating dialogues with multiple voices, emotions, pitch adjustments, DRC, pauses, and more. Here are my specific inquiries:

Specific Questions: Multi-Voice Support Confirmation: Can anyone confirm if Amazon Polly supports synthesis with multiple voices in one SSML document, including various audio effects?

Sample SSML Code Request: I need a detailed SSML example showcasing: Multiple Voices: At least two characters (e.g., Amy and Brian). Emotions: Different emotions with amazon:emotion (e.g., happy, sad, angry, excited, surprised, fearful, disgusted, contemptuous, neutral). Pitch Adjustments: Using <prosody> for emotional expression. Speed Modifications: Rate changes for dynamic pacing. Dynamic Range Compression (DRC): Using amazon:effect for DRC. Pauses: With <break> for conversation flow. Whispering: Using amazon:effect for whisper effects. Emphasis: Highlighting with <emphasis>. Date, Time, and Number Interpretation: Correct usage of <say-as>.

Here's a comprehensive structure:

xml <speak xmlns:amazon="http://aws.amazon.com/2008/08/speech/ssml"> <amazon:effect name="drc"> <voice name="Amy"> <amazon:emotion name="happy" intensity="medium"> <prosody pitch="+2%" rate="1.1">Good morning, Brian! It's a beautiful <say-as interpret-as="date">2025-02-05</say-as> today!</prosody> </amazon:emotion> <break time="500ms"/> <amazon:emotion name="excited" intensity="high"> <prosody pitch="+15%" rate="1.2">Oh, I just got an amazing deal on a <emphasis level="strong">NEW CAR</emphasis>!</prosody> </amazon:emotion> <break time="1s"/> <amazon:emotion name="surprised" intensity="medium"> <prosody pitch="+10%" rate="1.1">Wait, <emphasis level="moderate">you</emphasis> got one too?</prosody> </amazon:emotion> <amazon:effect name="whispered"> <prosody pitch="-5%" rate="0.9">But let's keep it a secret, okay?</prosody> </amazon:effect> </voice> <voice name="Brian"> <amazon:emotion name="sad" intensity="low"> <prosody pitch="-5%" rate="0.9">Hey Amy, I'm not feeling great. It's been a tough week.</prosody> </amazon:emotion> <break time="800ms"/> <amazon:emotion name="angry" intensity="high"> <prosody pitch="+10%" rate="1.2">And <emphasis level="strong">DON'T</emphasis> even get me started on the <say-as interpret-as="number">12</say-as> meetings I had!</prosody> </amazon:emotion> <break time="1.5s"/> <amazon:emotion name="fearful" intensity="medium"> <prosody pitch="+5%" rate="1.0">I'm <emphasis level="reduced">scared</emphasis> of what next week might bring.</prosody> </amazon:emotion> <amazon:effect name="whispered"> <prosody pitch="-5%" rate="0.9">Yeah, let's keep our good news quiet for now.</prosody> </amazon:effect> </voice> <voice name="Amy"> <amazon:emotion name="concerned" intensity="medium"> <prosody pitch="0%" rate="1.0">Oh no, Brian, I'm sorry to hear that. <break time="300ms"/> What can I do to help?</prosody> </amazon:emotion> <break time="1s"/> <amazon:emotion name="disgusted" intensity="medium"> <prosody pitch="-2%" rate="0.9">Those meetings sound <emphasis level="strong">awful</emphasis>.</prosody> </amazon:emotion> <amazon:emotion name="contemptuous" intensity="low"> <prosody pitch="-3%" rate="1.0">Some people just love to <emphasis level="moderate">waste time</emphasis>.</prosody> </amazon:emotion> </voice> <voice name="Brian"> <amazon:emotion name="neutral" intensity="medium"> <prosody pitch="0%" rate="1.0">Thanks, Amy. Maybe we can catch up later? <break time="500ms"/> It's <say-as interpret-as="time">19:04</say-as> now, so let's say around <say-as interpret-as="time">20:00</say-as>?</prosody>
</amazon:emotion> </voice> </amazon:effect> </speak>

Sample Conversational Audio File:

Does anyone have or know of an audio file where this SSML structure is used to simulate a conversation between Amy and Brian? This would be helpful for understanding the audio output.

Expected Outcome:

This example should demonstrate Amazon Polly's ability to handle complex SSML documents with multiple voices, diverse emotional expressions, pitch and speed adjustments, DRC, pauses, whispers, emphasis, and proper interpretation of dates, times, and numbers. This would provide insight into structuring dialogues for various scenarios.

Prior Research:

I've reviewed Amazon Polly's SSML documentation and tested basic multi-voice synthesis, but I'm seeking confirmation and examples for more complex use cases to ensure I'm utilizing all available features correctly.

Tags:

  • Amazon Polly
  • SSML
  • Text-to-Speech
  • Voice Synthesis

Thank you for your help. Your insights and examples would be crucial for understanding the full potential of Amazon Polly's SSML features.

asked a year ago306 views
1 Answer
1

Unfortunately Amazon Polly does not currently support generating audio with multiple voices in the same request. With Amazon Polly instead of using a <voice> tag, you pass in the voice to be used as an attribute, which is why you can only have one voice. Here is what that looks like using the AWS CLI:

aws polly synthesize-speech \
--voice-id Joanna \
--text-type ssml \
--text '<speak>Hello world</speak>' \
--output-format mp3 \
speech.mp3

You can find an overview here https://docs.aws.amazon.com/polly/latest/dg/example-ssml-synthesize-speech-cli.html, and additional code examples for various languages linked under here: https://docs.aws.amazon.com/polly/latest/dg/samples-and-examples.html If you want to use different voices you would have to generate the audio for each voice independently and then combine the audio afterwards.

The format and tags you are showing above seem to be for Alexa rather than Polly. Polly supports a smaller subset of the tags (see https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html) and not all of the tags are supported across all of the voices. So for instance <amazon:effect name="drc"> is available for Standard, Neural and Long-Form voices, but not Generative voices, whereas <amazon:effect name="whispered"> is only available on Standard voices. Unfortunately amazon.emotion is not supported at all. If a tag is listed on the supported tags page then it is available for the Standard voices, the table on the page then outlines whether it is also available for the other types of voices.

AWS
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.