Can Amazon Comprehend extract data from documents?

0

Hi! My team and I have the following scenario: we want to extract some fields from several PDF documents, that may or may not follow the same pattern. To exemplify, let's say we want to extract these 3 fields from these documents:

Enter image description here

So, we have a Name, a Code (called CNPJ) for this person, and its Address. Obviously, these fields would vary between documents, but the CNPJ would always keep its format, only changing the sequence of numbers. During our research to solve this challenge, we came across Amazon Comprehend and its Custom Named Entity Recognition. Our idea was to create these three entities - Name, CNPJ and Address - using a Ground Truth Labeling Job.

To do this, we Textracted some of our PDF's, generating .txt files for each one of them, and then uploaded these files to an S3 Bucket. After that, we proceeded to create the Labeling Job, using an Automated data setup to generate the input manifest file so the labeling could start. And what happened was that as I inputted many .txt files, each line in these files got recognized as a separate object, resulting in more than 7700 objects to be labeled. Of course, approximately 90% of these objects didn't had any labeling to be done, resulting in me having to continuously skip these lines until I had to label one of those objects, and also in a very high money cost due to the high number of objects.

So, I have a few questions. For starters, was Amazon Comprehend a good choice for this job? If it wasn't, what would be the best solution? If it was a good choice, what could I have done to optimize the labeling job? Were the "useless" objects really necessary?

已提问 2 年前544 查看次数
1 回答
2

I'm not sure you need Amazon Comprehend to achieve your goals. Amazon Textract supports 'Form Extraction' which is designed to find key-value pairs as in your example. Take a look at the docs: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html

I hope this helps!

AWS
Alex_K
已回答 2 年前
profile pictureAWS
专家
Chris_G
已审核 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则

相关内容