How to use Amazon Polly to resolve common implementation challenges

6 minute read
Advanced
3

Resolving common implementation challenges in development of Text to Speech applications using rich features in Amazon Polly

AWS offers a rich stack of AI/ML services which help automate several components of the customer service industry. Amazon Polly, an AI generated text-to-speech service, enables customers to automate and scale their interactive voice solutions, helping improve productivity and saving cost.

In a blog post Integrating Amazon Polly with legacy IVR systems by converting output to WAV format we discussed about the challenge of legacy system not supporting MP3, PCM file formats and looked at a solution to handle such a situation when working with Amazon Polly.

In this post we will continue to explore some of the practical challenges one may run into when working with text to speech applications and look at rich set of features offered by Amazon Polly, like SSML tags, lexicons etc, that could help us address these challenges like handling both SSML and plain text requests, dealing with common abbreviations used in text like "no." but always spoken as "number", adjusting the speech rate for best user experience.

Let us look at the some of the common challenges and how to resolve them using features like SSML tags and Lexicons.

Tackling both Plain Text and SSML requests at the same time

There is no single text type in Amazon Polly that would support plain text and SSML requests at the same time. So you will have either of the below as an example: a) Text without tag b) "The word <say-as interpret-as="characters">981122334455</say-as>", one could use [.withTextType(TextType.Ssml)] to generate speech which contains tags but it doesn't support simple text without tags.

Enter image description here User has to specify what type of request he’s sending to Amazon Polly (either text or ssml).

There are two ways to overcome this challenge:

Method 1: By wrapping all plain text requests in <speak></speak> tags and sending it as SSML. For example as below: Enter image description here

Method 2: Other way is to implement simple logic in the application code before making the API call, that would set text type based on the content.

Here is an example which checks the input text for the presence of <speak> tags and basis the same it uses the text-type: Enter image description here

Customized pronunciation – Lexicons to the rescue

Handling acronyms

Lexicons is one of the most powerful features in Amazon Polly. When inputs provided to Amzon Polly are in the form of acronyms (for e.g. “no.” needs to be interpreted as number) which are to be expanded or displayed in the output audio we could use Lexicons to achieve the desired result. Lexicon is a feature available in Amazon Polly which allows you to customize the pronunciation of words. They provide the ability to interpret certain inputs to specific output. You will need to create a Lexicon file which is in .pls or .xml format and upload this under the “Lexicon” tab found in the Amazon Polly console. Enter image description here

This Lexicon file should contain the source word (which will be present in the User input) and the target word (how Amazon Polly pronounces that word). You can test the usage through Amazon Polly console or SDK/CLI using SynthesizeSpeech or StartSpeechSynthesisTask by specifying the lexicon to be used. While using the Amazon Polly console for testing, you will need to specify the lexicon which needs to be used in the “Customize Pronunciation” settings. More information regarding lexicons and how these can be used are specified in Managing Lexicons

Some of the common words are typically written in a certain style, e.g. no. (for number). A Text-To-Speech (TTS) reads the text literally, and will read it as “no.” rather than “number” or when using abbreviations, for example "UNO" for "United Nations Organization". This is where you can leverage lexicons to customize the synthesized speech by using Amazon Polly.

Here is a sample lexicon file for reference, this way you will not need to modify the input using SSML tags for every occurrence of the word, more information can be found about it here https://docs.amazonaws.cn/en_us/polly/latest/dg/gs-put-lexicon.html Enter image description here Enter image description here

The above example demonstrates how to do it from the console. However, the same can be achieved using AWS CLI or AWS SDK. For example, when you are using Java for development and you want to leverage lexicons, remember to include “.withLexiconNames()” parameter in the code and specify the name of the Lexicon file enclosed in “<LEXICON_NAME> ”.

As shown in the example screenshot above you may supply multiple lexicon files and they will be evaluated in the order of preference as depicted. Please refer to Applying Multiple Lexicons for more details.

Dealing with Special Characters

SSML is W3C standard based on XML and escaping special characters is necessary because of the XML standard. Certain special characters such as “<”, “>”, “ ’ ” etc. can't be interpreted by Amazon Polly because of the aforementioned reason and may result in “Invalid SSML request” error. Hence, in order to get past the errors, we can escape these characters using escape codes. Enter image description here

For syntax and example please see Reserved Characters in SSML

Controlling the Speech Rate after conversion between formats

Sometimes when you convert the files for example WAV to PCM you may not have the desired speech rate, it may become slow or fast. You can overcome such situations using <prosody> tag. You can modify the speech rates using the <prosody> tag to provide inputs with slower speech rates which when converted would match with the required speech rate. For more information please refer Controlling Volume, Speaking Rate, and Pitch

Now, let’s slow it down to see how it works. You may change the speech speed using the prosody tag similar to below: Enter image description here

I have slowed it down 50% to show the effect of prosody tag.

The above challenges and their resolution using rich feature set in Amazon Polly should help you overcome most of the implementation hurdles when developing a typical automated response solution.

Further read, please refer the below Amazon Polly documentation for some great examples on usage of TAGs Supported SSML Tags Managing Lexicons Using the Amazon Polly Console

EXPERT
published 4 months ago651 views