Passing Ground Truth pdf labelling task output to Comprehend custom entity training

0

I want to do PDF parsing and identify Custom Named Entities. For that I setup Ground Truth to manually annotate few pdf files following this link - https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation-pdf.html .

It created a output folder and within that it created another folder with name as <job-id> and inside that manifests and annotations folders. When I tried to setup the Comprehend Custom Entity Recognition Job, it asks for bunch of input location. I don't know the values for these fields. From the output in S3 bucket, I came up with the below fields and I don't about what is attribute names.

If possible, Can someone share blog post or some video tutorial on pdf annotation and training similar from PDF input?

  1. SageMaker Ground Truth augmented manifest file S3 location s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/manifests/output/output.manifest

  2. S3 prefix for Annotation data files s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/annotations

  3. S3 prefix for Source documents s3://comprehend-semi-structured-docs-<region>-<id>/src

  4. Attribute names -- ?

1 Risposta
0
Risposta accettata

Some answers were in these 2 blogs

  1. https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/

After you label all the pages, you can find annotations in JSON format in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/

You can find your output manifest file in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/manifests/output/

  1. https://aws.amazon.com/pt/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/

the attribute name would be labeling-job-name-123.

Comprehend Training Input Location Fields

SageMaker Ground Truth augmented manifest file S3 location

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/manifests/output/output.manifest

S3 prefix for Annotation data files

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/annotations

S3 prefix for Source documents

s3://comprehend-semi-structured-docs-<region>-<id>/src

Attribute names

will be the Ground Truth job name.

con risposta 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande