Want Real-Time Custom Entity Recognition for PDFs Amazon Comprehend

0

Hello, I have trained a Custom Entity Recognition model that took PDFs as input. I am creating a program that takes PDFs as input, and then extracts specific entities from each pdf. In the future, this program would be used around 2 times per day for 20-30 one-page pdf documents. The problem is that I have been trying to use python and boto3 to do this, with an asynchronous job. This takes around 1 minute per document, which is too much time. It is meant to be very fast - the user uploads their pdf documents and then immediately should receive the entities. I looked into Batch jobs, but I don't know if it supports a pdf input, it looks like the format must be a text document. I also looked into endpoints, but I don't understand how to use it for pdfs. Can anyone tell me how I could I go about doing this?

preguntada hace 2 años506 visualizaciones
1 Respuesta
1

Hi.

Unfortunately, you need to use UTF-8-formatted text files in Asynchronous batch.

Documents must be in UTF-8-formatted text files.

https://docs.aws.amazon.com/comprehend/latest/dg/concepts-processing-modes.html#how-async

Alternatively, you can use Amazon Textract to extract Text from the PDF and send the data to Amazon Comprehend as follows:

https://aws.amazon.com/jp/blogs/machine-learning/extracting-custom-entities-from-documents-with-amazon-textract-and-amazon-comprehend/

profile picture
EXPERTO
iwasa
respondido hace 2 años
  • Hi, thanks for your quick response! I'm just wondering if it is okay that I trained my model using PDF annotations rather than extracting the text first and then annotating that? Would it still be accurate for the extracted text from the pdfs?

  • As you say, it's a good idea to thoroughly verify the accuracy of Amazon Textract first.

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas