Passing Ground Truth pdf labelling task output to Comprehend custom entity training

0

I want to do PDF parsing and identify Custom Named Entities. For that I setup Ground Truth to manually annotate few pdf files following this link - https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation-pdf.html .

It created a output folder and within that it created another folder with name as <job-id> and inside that manifests and annotations folders. When I tried to setup the Comprehend Custom Entity Recognition Job, it asks for bunch of input location. I don't know the values for these fields. From the output in S3 bucket, I came up with the below fields and I don't about what is attribute names.

If possible, Can someone share blog post or some video tutorial on pdf annotation and training similar from PDF input?

  1. SageMaker Ground Truth augmented manifest file S3 location s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/manifests/output/output.manifest

  2. S3 prefix for Annotation data files s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/annotations

  3. S3 prefix for Source documents s3://comprehend-semi-structured-docs-<region>-<id>/src

  4. Attribute names -- ?

asked a year ago381 views
1 Answer
0
Accepted Answer

Some answers were in these 2 blogs

  1. https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/

After you label all the pages, you can find annotations in JSON format in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/

You can find your output manifest file in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/manifests/output/

  1. https://aws.amazon.com/pt/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/

the attribute name would be labeling-job-name-123.

Comprehend Training Input Location Fields

SageMaker Ground Truth augmented manifest file S3 location

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/manifests/output/output.manifest

S3 prefix for Annotation data files

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/annotations

S3 prefix for Source documents

s3://comprehend-semi-structured-docs-<region>-<id>/src

Attribute names

will be the Ground Truth job name.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions