Hello, I have trained a Custom Entity Recognition model that took PDFs as input. I am creating a program that takes PDFs as input, and then extracts specific entities from each pdf. In the future, this program would be used around 2 times per day for 20-30 one-page pdf documents. The problem is that I have been trying to use python and boto3 to do this, with an asynchronous job. This takes around 1 minute per document, which is too much time. It is meant to be very fast - the user uploads their pdf documents and then immediately should receive the entities. I looked into Batch jobs, but I don't know if it supports a pdf input, it looks like the format must be a text document. I also looked into endpoints, but I don't understand how to use it for pdfs. Can anyone tell me how I could I go about doing this?
Hi, thanks for your quick response! I'm just wondering if it is okay that I trained my model using PDF annotations rather than extracting the text first and then annotating that? Would it still be accurate for the extracted text from the pdfs?
As you say, it's a good idea to thoroughly verify the accuracy of Amazon Textract first.