Can Amazon Comprehend extract data from documents?

0

Hi! My team and I have the following scenario: we want to extract some fields from several PDF documents, that may or may not follow the same pattern. To exemplify, let's say we want to extract these 3 fields from these documents:

Enter image description here

So, we have a Name, a Code (called CNPJ) for this person, and its Address. Obviously, these fields would vary between documents, but the CNPJ would always keep its format, only changing the sequence of numbers. During our research to solve this challenge, we came across Amazon Comprehend and its Custom Named Entity Recognition. Our idea was to create these three entities - Name, CNPJ and Address - using a Ground Truth Labeling Job.

To do this, we Textracted some of our PDF's, generating .txt files for each one of them, and then uploaded these files to an S3 Bucket. After that, we proceeded to create the Labeling Job, using an Automated data setup to generate the input manifest file so the labeling could start. And what happened was that as I inputted many .txt files, each line in these files got recognized as a separate object, resulting in more than 7700 objects to be labeled. Of course, approximately 90% of these objects didn't had any labeling to be done, resulting in me having to continuously skip these lines until I had to label one of those objects, and also in a very high money cost due to the high number of objects.

So, I have a few questions. For starters, was Amazon Comprehend a good choice for this job? If it wasn't, what would be the best solution? If it was a good choice, what could I have done to optimize the labeling job? Were the "useless" objects really necessary?

asked 2 years ago539 views
1 Answer
2

I'm not sure you need Amazon Comprehend to achieve your goals. Amazon Textract supports 'Form Extraction' which is designed to find key-value pairs as in your example. Take a look at the docs: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html

I hope this helps!

AWS
Alex_K
answered 2 years ago
profile pictureAWS
EXPERT
Chris_G
reviewed 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions