Custom ssml tag stays inside the speech mark sentence Polly generates

0

The documentation for custom tags (https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#custom-tag) says:

Placing a Custom Tag in Your Text

<mark>

This tag is supported by both neural and standard TTS formats.

To put a custom tag within the text, use the <mark> tag. Amazon Polly takes no action on the tag, but returns the location of the tag in the SSML metadata. This tag can be anything you want to call out, as long as it maintains the following format:

<mark name="tag_name"/> For example, suppose that the tag name is "animal" and the input text is: <speak> Mary had a little <mark name="animal"/>lamb. </speak> Amazon Polly might return the following SSML metadata:

{"time":767,"type":"ssml","start":25,"end":46,"value":"animal"}

So the above passage is from official AWS documentation

Yet, when I do this:

  public PollyResult synthesizeLongSpeechmarks(
          String fullText, String outbucket, String customerId, String documentId, PollyParams params, LambdaLogger logger) {
    //Replace image tags with SSML format
//    String processedText = replaceImageTagsWithSSML(fullText);
    String text = ssmlTextService.getSsmlText(params.getDomain(), fullText, params.getSpeekingRate(), logger);
    String destinationBucket = outbucket;
    String pollyRegion = System.getenv("-----");
    if (!System.getenv("AWS_REGION").equals(pollyRegion)) {
      destinationBucket = destinationBucket + "." + pollyRegion;
    }
    String ssmlString = "<speak> Mary had a little <mark name=\"animal\"/>lamb. </speak>";

    StartSpeechSynthesisTaskRequest request = new StartSpeechSynthesisTaskRequest()
        .withOutputS3BucketName(destinationBucket)
        .withOutputS3KeyPrefix("polly_speechmarks/" + customerId + "." + documentId)
        .withOutputFormat(OutputFormat.Json)
            .withSpeechMarkTypes(SpeechMarkType.Ssml, SpeechMarkType.Sentence)
            .withVoiceId(params.getVoiceId())
        .withTextType(TextType.Ssml)
        .withSampleRate(params.getSampleRate())
        .withSnsTopicArn(System.getenv("---  --"))
        .withEngine(params.getEngine())
        .withLanguageCode(params.getLanguageCode())
        .withText(ssmlString);
    try {
      StartSpeechSynthesisTaskResult result = pollyClient.startSpeechSynthesisTask(request);
      SynthesisTask task = result.getSynthesisTask();
      return new PollyResult(true, task.getTaskId(), task.getRequestCharacters());
    } catch (TextLengthExceededException e) {
      return new PollyResult(false, null, null);
    }
  }

the speech mark file I get is this one

{"time":0,"type":"sentence","start":8,"end":52,"value":"Mary had a little <mark name="animal"/>lamb."}

{"time":937,"type":"ssml","start":26,"end":47,"value":"animal"}

I was going for {"time":0,"type":"sentence","start":8,"end":52,"value":"Mary had a little lamb."}

{"time":937,"type":"ssml","start":26,"end":47,"value":"animal"}

已提问 5 个月前136 查看次数
1 回答
0

Speechmarks primarily give you offsets in text. The value element is simply a substring of input text between start and end and is provided just for convenience. Hence it contains raw input with all the SSML (and even xml comments) that were there. Relevant documentation: https://docs.aws.amazon.com/polly/latest/dg/using-speechmarks.html#output

So if you want to have just plain text, you'll have to process it yourself to remove tags. Depending on how precise you want it to be, regex (value.replaceAll("<.*?>", "")) could be sufficient (but keep in mind it's going to break on corner cases like <sub>).

AWS
TB
已回答 5 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则