Want Real-Time Custom Entity Recognition for PDFs Amazon Comprehend

Question

Hello, I have trained a Custom Entity Recognition model that took PDFs as input. I am creating a program that takes PDFs as input, and then extracts specific entities from each pdf. In the future, this program would be used around 2 times per day for 20-30 one-page pdf documents. The problem is that I have been trying to use python and boto3 to do this, with an asynchronous job. This takes around 1 minute per document, which is too much time. It is meant to be very fast - the user uploads their pdf documents and then immediately should receive the entities. I looked into Batch jobs, but I don't know if it supports a pdf input, it looks like the format must be a text document. I also looked into endpoints, but I don't understand how to use it for pdfs. Can anyone tell me how I could I go about doing this?

Answer

Hi.

Unfortunately, you need to use UTF-8-formatted text files in Asynchronous batch.

> Documents must be in UTF-8-formatted text files.

https://docs.aws.amazon.com/comprehend/latest/dg/concepts-processing-modes.html#how-async

Alternatively, you can use Amazon Textract to extract Text from the PDF and send the data to Amazon Comprehend as follows:

https://aws.amazon.com/jp/blogs/machine-learning/extracting-custom-entities-from-documents-with-amazon-textract-and-amazon-comprehend/

Want Real-Time Custom Entity Recognition for PDFs Amazon Comprehend

Contenuto pertinente