Custom ssml tag stays inside the speech mark sentence Polly generates

0

The documentation for custom tags (https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#custom-tag) says:

Placing a Custom Tag in Your Text

<mark>

This tag is supported by both neural and standard TTS formats.

To put a custom tag within the text, use the <mark> tag. Amazon Polly takes no action on the tag, but returns the location of the tag in the SSML metadata. This tag can be anything you want to call out, as long as it maintains the following format:

<mark name="tag_name"/> For example, suppose that the tag name is "animal" and the input text is: <speak> Mary had a little <mark name="animal"/>lamb. </speak> Amazon Polly might return the following SSML metadata:

{"time":767,"type":"ssml","start":25,"end":46,"value":"animal"}

So the above passage is from official AWS documentation

Yet, when I do this:

  public PollyResult synthesizeLongSpeechmarks(
          String fullText, String outbucket, String customerId, String documentId, PollyParams params, LambdaLogger logger) {
    //Replace image tags with SSML format
//    String processedText = replaceImageTagsWithSSML(fullText);
    String text = ssmlTextService.getSsmlText(params.getDomain(), fullText, params.getSpeekingRate(), logger);
    String destinationBucket = outbucket;
    String pollyRegion = System.getenv("-----");
    if (!System.getenv("AWS_REGION").equals(pollyRegion)) {
      destinationBucket = destinationBucket + "." + pollyRegion;
    }
    String ssmlString = "<speak> Mary had a little <mark name=\"animal\"/>lamb. </speak>";

    StartSpeechSynthesisTaskRequest request = new StartSpeechSynthesisTaskRequest()
        .withOutputS3BucketName(destinationBucket)
        .withOutputS3KeyPrefix("polly_speechmarks/" + customerId + "." + documentId)
        .withOutputFormat(OutputFormat.Json)
            .withSpeechMarkTypes(SpeechMarkType.Ssml, SpeechMarkType.Sentence)
            .withVoiceId(params.getVoiceId())
        .withTextType(TextType.Ssml)
        .withSampleRate(params.getSampleRate())
        .withSnsTopicArn(System.getenv("---  --"))
        .withEngine(params.getEngine())
        .withLanguageCode(params.getLanguageCode())
        .withText(ssmlString);
    try {
      StartSpeechSynthesisTaskResult result = pollyClient.startSpeechSynthesisTask(request);
      SynthesisTask task = result.getSynthesisTask();
      return new PollyResult(true, task.getTaskId(), task.getRequestCharacters());
    } catch (TextLengthExceededException e) {
      return new PollyResult(false, null, null);
    }
  }

the speech mark file I get is this one

{"time":0,"type":"sentence","start":8,"end":52,"value":"Mary had a little <mark name="animal"/>lamb."}

{"time":937,"type":"ssml","start":26,"end":47,"value":"animal"}

I was going for {"time":0,"type":"sentence","start":8,"end":52,"value":"Mary had a little lamb."}

{"time":937,"type":"ssml","start":26,"end":47,"value":"animal"}

asked 4 months ago126 views
1 Answer
0

Speechmarks primarily give you offsets in text. The value element is simply a substring of input text between start and end and is provided just for convenience. Hence it contains raw input with all the SSML (and even xml comments) that were there. Relevant documentation: https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks.html#output

So if you want to have just plain text, you'll have to process it yourself to remove tags. Depending on how precise you want it to be, regex (value.replaceAll("<.*?>", "")) could be sufficient (but keep in mind it's going to break on corner cases like <sub>).

AWS
TB
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions