Can Amazon Comprehend extract data from documents?

0

Hi! My team and I have the following scenario: we want to extract some fields from several PDF documents, that may or may not follow the same pattern. To exemplify, let's say we want to extract these 3 fields from these documents:

Enter image description here

So, we have a Name, a Code (called CNPJ) for this person, and its Address. Obviously, these fields would vary between documents, but the CNPJ would always keep its format, only changing the sequence of numbers. During our research to solve this challenge, we came across Amazon Comprehend and its Custom Named Entity Recognition. Our idea was to create these three entities - Name, CNPJ and Address - using a Ground Truth Labeling Job.

To do this, we Textracted some of our PDF's, generating .txt files for each one of them, and then uploaded these files to an S3 Bucket. After that, we proceeded to create the Labeling Job, using an Automated data setup to generate the input manifest file so the labeling could start. And what happened was that as I inputted many .txt files, each line in these files got recognized as a separate object, resulting in more than 7700 objects to be labeled. Of course, approximately 90% of these objects didn't had any labeling to be done, resulting in me having to continuously skip these lines until I had to label one of those objects, and also in a very high money cost due to the high number of objects.

So, I have a few questions. For starters, was Amazon Comprehend a good choice for this job? If it wasn't, what would be the best solution? If it was a good choice, what could I have done to optimize the labeling job? Were the "useless" objects really necessary?

demandé il y a 2 ans544 vues
1 réponse
2

I'm not sure you need Amazon Comprehend to achieve your goals. Amazon Textract supports 'Form Extraction' which is designed to find key-value pairs as in your example. Take a look at the docs: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html

I hope this helps!

AWS
Alex_K
répondu il y a 2 ans
profile pictureAWS
EXPERT
Chris_G
vérifié il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions