Can Amazon Comprehend extract data from documents?

0

Hi! My team and I have the following scenario: we want to extract some fields from several PDF documents, that may or may not follow the same pattern. To exemplify, let's say we want to extract these 3 fields from these documents:

Enter image description here

So, we have a Name, a Code (called CNPJ) for this person, and its Address. Obviously, these fields would vary between documents, but the CNPJ would always keep its format, only changing the sequence of numbers. During our research to solve this challenge, we came across Amazon Comprehend and its Custom Named Entity Recognition. Our idea was to create these three entities - Name, CNPJ and Address - using a Ground Truth Labeling Job.

To do this, we Textracted some of our PDF's, generating .txt files for each one of them, and then uploaded these files to an S3 Bucket. After that, we proceeded to create the Labeling Job, using an Automated data setup to generate the input manifest file so the labeling could start. And what happened was that as I inputted many .txt files, each line in these files got recognized as a separate object, resulting in more than 7700 objects to be labeled. Of course, approximately 90% of these objects didn't had any labeling to be done, resulting in me having to continuously skip these lines until I had to label one of those objects, and also in a very high money cost due to the high number of objects.

So, I have a few questions. For starters, was Amazon Comprehend a good choice for this job? If it wasn't, what would be the best solution? If it was a good choice, what could I have done to optimize the labeling job? Were the "useless" objects really necessary?

質問済み 2年前544ビュー
1回答
2

I'm not sure you need Amazon Comprehend to achieve your goals. Amazon Textract supports 'Form Extraction' which is designed to find key-value pairs as in your example. Take a look at the docs: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html

I hope this helps!

AWS
Alex_K
回答済み 2年前
profile pictureAWS
エキスパート
Chris_G
レビュー済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ