Passing Ground Truth pdf labelling task output to Comprehend custom entity training

0

I want to do PDF parsing and identify Custom Named Entities. For that I setup Ground Truth to manually annotate few pdf files following this link - https://docs.aws.amazon.com/comprehend/latest/dg/cer-annotation-pdf.html .

It created a output folder and within that it created another folder with name as <job-id> and inside that manifests and annotations folders. When I tried to setup the Comprehend Custom Entity Recognition Job, it asks for bunch of input location. I don't know the values for these fields. From the output in S3 bucket, I came up with the below fields and I don't about what is attribute names.

If possible, Can someone share blog post or some video tutorial on pdf annotation and training similar from PDF input?

  1. SageMaker Ground Truth augmented manifest file S3 location s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/manifests/output/output.manifest

  2. S3 prefix for Annotation data files s3://comprehend-semi-structured-docs-<region>-<id>/output/labeling-job-20221025T135656/annotations

  3. S3 prefix for Source documents s3://comprehend-semi-structured-docs-<region>-<id>/src

  4. Attribute names -- ?

demandé il y a 2 ans391 vues
1 réponse
0
Réponse acceptée

Some answers were in these 2 blogs

  1. https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/

After you label all the pages, you can find annotations in JSON format in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/

You can find your output manifest file in the Amazon S3 location s3://comprehend-semi-structured-docs-<AWS Region>-<AWS Account number>/output/<your labeling job>/manifests/output/

  1. https://aws.amazon.com/pt/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/

the attribute name would be labeling-job-name-123.

Comprehend Training Input Location Fields

SageMaker Ground Truth augmented manifest file S3 location

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/manifests/output/output.manifest

S3 prefix for Annotation data files

s3://comprehend-semi-structured-docs-<region>-<id>/output/resume-labeling-job-20221025T135656/annotations

S3 prefix for Source documents

s3://comprehend-semi-structured-docs-<region>-<id>/src

Attribute names

will be the Ground Truth job name.

répondu il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions