I'm exploring Amazon Polly's capabilities with SSML for creating dialogues with multiple voices, emotions, pitch adjustments, DRC, pauses, and more. Here are my specific inquiries:
Specific Questions:
Multi-Voice Support Confirmation: Can anyone confirm if Amazon Polly supports synthesis with multiple voices in one SSML document, including various audio effects?
Sample SSML Code Request:
I need a detailed SSML example showcasing:
Multiple Voices: At least two characters (e.g., Amy and Brian).
Emotions: Different emotions with amazon:emotion (e.g., happy, sad, angry, excited, surprised, fearful, disgusted, contemptuous, neutral).
Pitch Adjustments: Using <prosody> for emotional expression.
Speed Modifications: Rate changes for dynamic pacing.
Dynamic Range Compression (DRC): Using amazon:effect for DRC.
Pauses: With <break> for conversation flow.
Whispering: Using amazon:effect for whisper effects.
Emphasis: Highlighting with <emphasis>.
Date, Time, and Number Interpretation: Correct usage of <say-as>.
Here's a comprehensive structure:
xml
<speak xmlns:amazon="http://aws.amazon.com/2008/08/speech/ssml">
<amazon:effect name="drc">
<voice name="Amy">
<amazon:emotion name="happy" intensity="medium">
<prosody pitch="+2%" rate="1.1">Good morning, Brian! It's a beautiful <say-as interpret-as="date">2025-02-05</say-as> today!</prosody>
</amazon:emotion>
<break time="500ms"/>
<amazon:emotion name="excited" intensity="high">
<prosody pitch="+15%" rate="1.2">Oh, I just got an amazing deal on a <emphasis level="strong">NEW CAR</emphasis>!</prosody>
</amazon:emotion>
<break time="1s"/>
<amazon:emotion name="surprised" intensity="medium">
<prosody pitch="+10%" rate="1.1">Wait, <emphasis level="moderate">you</emphasis> got one too?</prosody>
</amazon:emotion>
<amazon:effect name="whispered">
<prosody pitch="-5%" rate="0.9">But let's keep it a secret, okay?</prosody>
</amazon:effect>
</voice>
<voice name="Brian">
<amazon:emotion name="sad" intensity="low">
<prosody pitch="-5%" rate="0.9">Hey Amy, I'm not feeling great. It's been a tough week.</prosody>
</amazon:emotion>
<break time="800ms"/>
<amazon:emotion name="angry" intensity="high">
<prosody pitch="+10%" rate="1.2">And <emphasis level="strong">DON'T</emphasis> even get me started on the <say-as interpret-as="number">12</say-as> meetings I had!</prosody>
</amazon:emotion>
<break time="1.5s"/>
<amazon:emotion name="fearful" intensity="medium">
<prosody pitch="+5%" rate="1.0">I'm <emphasis level="reduced">scared</emphasis> of what next week might bring.</prosody>
</amazon:emotion>
<amazon:effect name="whispered">
<prosody pitch="-5%" rate="0.9">Yeah, let's keep our good news quiet for now.</prosody>
</amazon:effect>
</voice>
<voice name="Amy">
<amazon:emotion name="concerned" intensity="medium">
<prosody pitch="0%" rate="1.0">Oh no, Brian, I'm sorry to hear that. <break time="300ms"/> What can I do to help?</prosody>
</amazon:emotion>
<break time="1s"/>
<amazon:emotion name="disgusted" intensity="medium">
<prosody pitch="-2%" rate="0.9">Those meetings sound <emphasis level="strong">awful</emphasis>.</prosody>
</amazon:emotion>
<amazon:emotion name="contemptuous" intensity="low">
<prosody pitch="-3%" rate="1.0">Some people just love to <emphasis level="moderate">waste time</emphasis>.</prosody>
</amazon:emotion>
</voice>
<voice name="Brian">
<amazon:emotion name="neutral" intensity="medium">
<prosody pitch="0%" rate="1.0">Thanks, Amy. Maybe we can catch up later? <break time="500ms"/> It's <say-as interpret-as="time">19:04</say-as> now, so let's say around <say-as interpret-as="time">20:00</say-as>?</prosody>
</amazon:emotion>
</voice>
</amazon:effect>
</speak>
Sample Conversational Audio File:
Does anyone have or know of an audio file where this SSML structure is used to simulate a conversation between Amy and Brian? This would be helpful for understanding the audio output.
Expected Outcome:
This example should demonstrate Amazon Polly's ability to handle complex SSML documents with multiple voices, diverse emotional expressions, pitch and speed adjustments, DRC, pauses, whispers, emphasis, and proper interpretation of dates, times, and numbers. This would provide insight into structuring dialogues for various scenarios.
Prior Research:
I've reviewed Amazon Polly's SSML documentation and tested basic multi-voice synthesis, but I'm seeking confirmation and examples for more complex use cases to ensure I'm utilizing all available features correctly.
Tags:
- Amazon Polly
- SSML
- Text-to-Speech
- Voice Synthesis
Thank you for your help. Your insights and examples would be crucial for understanding the full potential of Amazon Polly's SSML features.