Want Real-Time Custom Entity Recognition for PDFs Amazon Comprehend

0

Hello, I have trained a Custom Entity Recognition model that took PDFs as input. I am creating a program that takes PDFs as input, and then extracts specific entities from each pdf. In the future, this program would be used around 2 times per day for 20-30 one-page pdf documents. The problem is that I have been trying to use python and boto3 to do this, with an asynchronous job. This takes around 1 minute per document, which is too much time. It is meant to be very fast - the user uploads their pdf documents and then immediately should receive the entities. I looked into Batch jobs, but I don't know if it supports a pdf input, it looks like the format must be a text document. I also looked into endpoints, but I don't understand how to use it for pdfs. Can anyone tell me how I could I go about doing this?

질문됨 2년 전506회 조회
1개 답변
1

Hi.

Unfortunately, you need to use UTF-8-formatted text files in Asynchronous batch.

Documents must be in UTF-8-formatted text files.

https://docs.aws.amazon.com/comprehend/latest/dg/concepts-processing-modes.html#how-async

Alternatively, you can use Amazon Textract to extract Text from the PDF and send the data to Amazon Comprehend as follows:

https://aws.amazon.com/jp/blogs/machine-learning/extracting-custom-entities-from-documents-with-amazon-textract-and-amazon-comprehend/

profile picture
전문가
iwasa
답변함 2년 전
  • Hi, thanks for your quick response! I'm just wondering if it is okay that I trained my model using PDF annotations rather than extracting the text first and then annotating that? Would it still be accurate for the extracted text from the pdfs?

  • As you say, it's a good idea to thoroughly verify the accuracy of Amazon Textract first.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠