- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
I found that changing the line endings from LF (ASCII 13) to CR (ASCII 10) fixed the issue. As far as I can see, this requirement is not documented at https://docs.aws.amazon.com/transcribe/latest/dg/how-vocabulary.html#char-english.
This does not work for me. In fact, anything I put in IPA column fails with "Validation error: File contains invalid characters or format in the IPA column."
Content is
PhraseTABIPATABSoundsLikeTABDisplayAsCR
HeyTABhe:ɪTABTABCR
No issues with list format but I can't create vocabulary with table format including IPA field. I have another vocabulary that contains only Phrase and DisplayAs - and it works fine.
Does this feature work at all?
Edited by: jzgoda on Sep 3, 2020 8:48 AM
WTF formatting
It is now in the documentation. https://docs.aws.amazon.com/transcribe/latest/dg/how-vocabulary.html
We have reviewed the documentation in detail and keep getting the same error. We use LF format as indicated in the docs and unicode UTF8.
When using UTF8, we get the "IPA invalid characters" error but if we change to UTF-16 big endian in order to fit the characters code with 2 bytes, we get a "invalid header names" error.
We keep researching.. any clue?
I wasn't able to fix this issue after trying various line endings. I started with their example download file which works.
I think this is related to what they consider to be valid IPA characters. I can't find any docs from AWS on valid IPA other than industry info.
Example file:
Phrase IPA SoundsLike DisplayAs
Los-Angeles Los Angeles
F.B.I. ɛ f b i aɪ FBI
Etienne eh-tee-en
Using their own example from the Guide https://docs.aws.amazon.com/transcribe/latest/dg/charsets.html#char-french
Many Unicode characters can appear identical in popular fonts, even if they use different code points. Only the code points listed in this guide are supported. For example, the French word déjà can be rendered using precomposed characters (where one Unicode value represents an accented character) or decomposed characters (where two Unicode values represent an accented character, one value for the base character and another for the accent).
Precomposed version: 0064 00E9 006A 00E0 (renders as déjà)
Decomposed version: 0064 0065 0301 006A 0061 0300 (renders as déjà)
I tried to create a very simple french vocabulary for testing:
Phrase[TAB]SoundsLike[TAB]IPA[TAB]DisplayAs[EOL]
Test[TAB][TAB]déjà[TAB]Test[EOL]
using either [LF], [CR] or [CRLF] as EOL, UTF-8, UTF-16 BE, UTF-16 LE always getting one of the already in this thread described errors,
"Validation error: Your custom vocabulary file contains one or more unsupported characters ("é", "à") in the IPA column on line 2."
I made sure to use exactly the characters from the charset-guideline (copying from the guideline, checking my vocabulary in a HEX-editor) and it does not work.
If I create a Vocabulary, where only one of those characters is used (and no other character, even those with only one byte) then it works:
Phrase[TAB]SoundsLike[TAB]IPA[TAB]DisplayAs[EOL]
Test[TAB][TAB]é[TAB]Test[EOL]
it looks like an internal conversion breaks because the second byte mixes stuff up.
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 3 Jahren
- AWS OFFICIALAktualisiert vor 2 Jahren