Want Real-Time Custom Entity Recognition for PDFs Amazon Comprehend

0

Hello, I have trained a Custom Entity Recognition model that took PDFs as input. I am creating a program that takes PDFs as input, and then extracts specific entities from each pdf. In the future, this program would be used around 2 times per day for 20-30 one-page pdf documents. The problem is that I have been trying to use python and boto3 to do this, with an asynchronous job. This takes around 1 minute per document, which is too much time. It is meant to be very fast - the user uploads their pdf documents and then immediately should receive the entities. I looked into Batch jobs, but I don't know if it supports a pdf input, it looks like the format must be a text document. I also looked into endpoints, but I don't understand how to use it for pdfs. Can anyone tell me how I could I go about doing this?

asked 2 years ago495 views
1 Answer
1

Hi.

Unfortunately, you need to use UTF-8-formatted text files in Asynchronous batch.

Documents must be in UTF-8-formatted text files.

https://docs.aws.amazon.com/comprehend/latest/dg/concepts-processing-modes.html#how-async

Alternatively, you can use Amazon Textract to extract Text from the PDF and send the data to Amazon Comprehend as follows:

https://aws.amazon.com/jp/blogs/machine-learning/extracting-custom-entities-from-documents-with-amazon-textract-and-amazon-comprehend/

profile picture
EXPERT
iwasa
answered 2 years ago
  • Hi, thanks for your quick response! I'm just wondering if it is okay that I trained my model using PDF annotations rather than extracting the text first and then annotating that? Would it still be accurate for the extracted text from the pdfs?

  • As you say, it's a good idea to thoroughly verify the accuracy of Amazon Textract first.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions