Passing Ground Truth pdf labelling task output to Comprehend custom entity training

0

I want to do PDF parsing and identify Custom Named Entities. For that I setup Ground Truth to manually annotate few pdf files following this link - https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation-pdf.html .

It created a output folder and within that it created another folder with name as <job-id> and inside that manifests and annotations folders. When I tried to setup the Comprehend Custom Entity Recognition Job, it asks for bunch of input location. I don't know the values for these fields. From the output in S3 bucket, I came up with the below fields and I don't about what is attribute names.

If possible, Can someone share blog post or some video tutorial on pdf annotation and training similar from PDF input?

  1. SageMaker Ground Truth augmented manifest file S3 location s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/manifests/output/output.manifest

  2. S3 prefix for Annotation data files s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/annotations

  3. S3 prefix for Source documents s3://comprehend-semi-structured-docs-<region>-<id>/src

  4. Attribute names -- ?

질문됨 2년 전391회 조회
1개 답변
0
수락된 답변

Some answers were in these 2 blogs

  1. https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/

After you label all the pages, you can find annotations in JSON format in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/

You can find your output manifest file in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/manifests/output/

  1. https://aws.amazon.com/pt/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/

the attribute name would be labeling-job-name-123.

Comprehend Training Input Location Fields

SageMaker Ground Truth augmented manifest file S3 location

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/manifests/output/output.manifest

S3 prefix for Annotation data files

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/annotations

S3 prefix for Source documents

s3://comprehend-semi-structured-docs-<region>-<id>/src

Attribute names

will be the Ground Truth job name.

답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠