Creating new custom vocabularies from a table is broken

0

When I try to use the AWS Transcribe console to create a new custom vocabulary from a file stored on S3, it always returns this result: "Internal Failure. Please try your request again."

This happens even when using the example provided in the documentation, attached here.

It seems like this feature is broken. Can anybody validate whether you've ever gotten a custom vocabulary table to work? If so, could you post your text file so that I could try it out?

asked 4 years ago589 views
7 Answers
0

I found that changing the line endings from LF (ASCII 13) to CR (ASCII 10) fixed the issue. As far as I can see, this requirement is not documented at https://docs.aws.amazon.com/transcribe/latest/dg/how-vocabulary.html#char-english.

answered 4 years ago
0

Changing line endings to ASCII 10 fixed it.

answered 4 years ago
0

This does not work for me. In fact, anything I put in IPA column fails with "Validation error: File contains invalid characters or format in the IPA column."

Content is
PhraseTABIPATABSoundsLikeTABDisplayAsCR
HeyTABhe:ɪTABTABCR

No issues with list format but I can't create vocabulary with table format including IPA field. I have another vocabulary that contains only Phrase and DisplayAs - and it works fine.

Does this feature work at all?

Edited by: jzgoda on Sep 3, 2020 8:48 AM

WTF formatting

jzgoda
answered 4 years ago
0
Earl L.
answered 3 years ago
0

We have reviewed the documentation in detail and keep getting the same error. We use LF format as indicated in the docs and unicode UTF8.

When using UTF8, we get the "IPA invalid characters" error but if we change to UTF-16 big endian in order to fit the characters code with 2 bytes, we get a "invalid header names" error.

We keep researching.. any clue?

Fer
answered 2 years ago
0

I wasn't able to fix this issue after trying various line endings. I started with their example download file which works.

I think this is related to what they consider to be valid IPA characters. I can't find any docs from AWS on valid IPA other than industry info.

Example file:

Phrase	IPA	SoundsLike	DisplayAs
Los-Angeles			Los Angeles
F.B.I.	ɛ f b i aɪ		FBI
Etienne		eh-tee-en	
answered 2 years ago
0

Using their own example from the Guide https://docs.aws.amazon.com/transcribe/latest/dg/charsets.html#char-french

Many Unicode characters can appear identical in popular fonts, even if they use different code points. Only the code points listed in this guide are supported. For example, the French word déjà can be rendered using precomposed characters (where one Unicode value represents an accented character) or decomposed characters (where two Unicode values represent an accented character, one value for the base character and another for the accent).

Precomposed version: 0064 00E9 006A 00E0 (renders as déjà)

Decomposed version: 0064 0065 0301 006A 0061 0300 (renders as déjà)

I tried to create a very simple french vocabulary for testing:

Phrase[TAB]SoundsLike[TAB]IPA[TAB]DisplayAs[EOL]

Test[TAB][TAB]déjà[TAB]Test[EOL]

using either [LF], [CR] or [CRLF] as EOL, UTF-8, UTF-16 BE, UTF-16 LE always getting one of the already in this thread described errors,

"Validation error: Your custom vocabulary file contains one or more unsupported characters ("é", "à") in the IPA column on line 2."

I made sure to use exactly the characters from the charset-guideline (copying from the guideline, checking my vocabulary in a HEX-editor) and it does not work.

If I create a Vocabulary, where only one of those characters is used (and no other character, even those with only one byte) then it works:

Phrase[TAB]SoundsLike[TAB]IPA[TAB]DisplayAs[EOL]

Test[TAB][TAB]é[TAB]Test[EOL]

it looks like an internal conversion breaks because the second byte mixes stuff up.

puchmu
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions