Want Real-Time Custom Entity Recognition for PDFs Amazon Comprehend

0

Hello, I have trained a Custom Entity Recognition model that took PDFs as input. I am creating a program that takes PDFs as input, and then extracts specific entities from each pdf. In the future, this program would be used around 2 times per day for 20-30 one-page pdf documents. The problem is that I have been trying to use python and boto3 to do this, with an asynchronous job. This takes around 1 minute per document, which is too much time. It is meant to be very fast - the user uploads their pdf documents and then immediately should receive the entities. I looked into Batch jobs, but I don't know if it supports a pdf input, it looks like the format must be a text document. I also looked into endpoints, but I don't understand how to use it for pdfs. Can anyone tell me how I could I go about doing this?

gefragt vor 2 Jahren506 Aufrufe
1 Antwort
1

Hi.

Unfortunately, you need to use UTF-8-formatted text files in Asynchronous batch.

Documents must be in UTF-8-formatted text files.

https://docs.aws.amazon.com/comprehend/latest/dg/concepts-processing-modes.html#how-async

Alternatively, you can use Amazon Textract to extract Text from the PDF and send the data to Amazon Comprehend as follows:

https://aws.amazon.com/jp/blogs/machine-learning/extracting-custom-entities-from-documents-with-amazon-textract-and-amazon-comprehend/

profile picture
EXPERTE
iwasa
beantwortet vor 2 Jahren
  • Hi, thanks for your quick response! I'm just wondering if it is okay that I trained my model using PDF annotations rather than extracting the text first and then annotating that? Would it still be accurate for the extracted text from the pdfs?

  • As you say, it's a good idea to thoroughly verify the accuracy of Amazon Textract first.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen