Passing Ground Truth pdf labelling task output to Comprehend custom entity training

0

I want to do PDF parsing and identify Custom Named Entities. For that I setup Ground Truth to manually annotate few pdf files following this link - https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation-pdf.html .

It created a output folder and within that it created another folder with name as <job-id> and inside that manifests and annotations folders. When I tried to setup the Comprehend Custom Entity Recognition Job, it asks for bunch of input location. I don't know the values for these fields. From the output in S3 bucket, I came up with the below fields and I don't about what is attribute names.

If possible, Can someone share blog post or some video tutorial on pdf annotation and training similar from PDF input?

  1. SageMaker Ground Truth augmented manifest file S3 location s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/manifests/output/output.manifest

  2. S3 prefix for Annotation data files s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/annotations

  3. S3 prefix for Source documents s3://comprehend-semi-structured-docs-<region>-<id>/src

  4. Attribute names -- ?

gefragt vor 2 Jahren391 Aufrufe
1 Antwort
0
Akzeptierte Antwort

Some answers were in these 2 blogs

  1. https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/

After you label all the pages, you can find annotations in JSON format in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/

You can find your output manifest file in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/manifests/output/

  1. https://aws.amazon.com/pt/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/

the attribute name would be labeling-job-name-123.

Comprehend Training Input Location Fields

SageMaker Ground Truth augmented manifest file S3 location

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/manifests/output/output.manifest

S3 prefix for Annotation data files

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/annotations

S3 prefix for Source documents

s3://comprehend-semi-structured-docs-<region>-<id>/src

Attribute names

will be the Ground Truth job name.

beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen