1 Answer
- Newest
- Most votes
- Most comments
1
Hi.
Unfortunately, you need to use UTF-8-formatted text files in Asynchronous batch.
Documents must be in UTF-8-formatted text files.
https://docs.aws.amazon.com/comprehend/latest/dg/concepts-processing-modes.html#how-async
Alternatively, you can use Amazon Textract to extract Text from the PDF and send the data to Amazon Comprehend as follows:
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
Hi, thanks for your quick response! I'm just wondering if it is okay that I trained my model using PDF annotations rather than extracting the text first and then annotating that? Would it still be accurate for the extracted text from the pdfs?
As you say, it's a good idea to thoroughly verify the accuracy of Amazon Textract first.