By using AWS re:Post, you agree to the Terms of Use

Input Manifest Errors in Sagemaker Ground Truth for Custom Labeling Job


I am attempting to create a native PDF annotation labeling job for use with Comprehend to identify entities within similar documents. I have around 20 pdf files, all of them around 100-300 pages long.

I used the tools and followed the directions from this blog post. I've struggled a little with the tools but ultimately got everything working.

My problem comes from the labeling job itself. When I open the labeling job in a private workforce that I've created, I find only a blank page. I did some research and found that there is something wrong with the input manifest as it seems AWS isn't able to parse it for some reason.

I checked my manifest and found that it was generated as multiple objects. Each object was a single page from my PDFs. This seems normal, however the objects were not put into a list or 'top level' object, which does not fit JSON Lines guidelines. I attempted a quick fix of placing these objects all within a list (which satisfies JSON Lines) but it does not seem to help.

Any suggestions or advice would be greatly appreciated.

  • I've done some further research into the matter as well as some experimentation.

    {"source-ref": "s3://comprehend-semi-structured-docs-us-east-2-451256804668/src/Fiserv_10-K_2017.pdf", "page": "1", "metadata": {"pages": "82", "use-textract-only": false, "labels": ["Legal", "Contact", "Service", "Technology", "Risk", "Money", "VIP", "Ethic", "Health"]}, "annotator-metadata": {"info": "sample", "Due Date": "12/12/2023"}, "primary-annotation-ref": null, "secondary-annotation-ref": null}

    This is an example of one line in my manifest.

  • I've been using JSONLint to verify the JSON structure of my document, and the initial manifest does not meet the criteria. It reads the first object, then throws an EOF error as it there is no comma nor top level object to contain all the other objects.

    Yet as referenced here: This is the proper way to construct the manifest document. I've attempted creating a list for the objects to reside in, as well as a top level object called 'Sources' to hold all of the other objects. Neither solution has worked so far.

  • I have done some more research after some extensive testing.

    It seems that there may be something wrong with my pre-processing lambda. Currently when investigating cloudwatch, I am showing my poppler layer as having read and execute access; but not write access.

    This makes sense, as the poppler layer is responsible for PDFPlumber which is responsible for interacting with the actual pdf page. If the Poppler layer doesn't have write access, it would make sense that the label jobs would be submitted without any content.

    My challenge now is to figure out how to change these permissions.

  • In addition to this, I've found that my JSON file is being 'read' when I submit an API-Call to Sagemaker for a labeling job. It seems as if that when I submit a job through the console, the initial JSON Parse always fails. I've tested this with other JSON files formatted in several different ways (according to JSON Lines and AWS examples) and they never seem to work.

    I am unsure if this is a bug or if there is something wrong with my setup, but it seems to function as intended when submitted through the CLI as an API-Request.

1 Answers

Dear Customer,

Thank you so much for reaching to us. I understand that you followed our AWS Comprehend documentation for annotating PDF’s, in-order to annotate your training PDFs in SageMaker Ground Truth. However, there were some issues with the tool showing a blank page in the UI, and you are assuming it may be an issue with the input manifest. Hence you were looking for guidance in resolving this issue.

Thank you so much for providing the details.

To further better assist you on this issue, can you please create a Support Ticket to AWS. Below link will assist you to create the Support Ticket. [+]


—While creating the support ticket, we kindly request you to provide the below information

Use-case description.
Ground Truth Job ARN details
Screen-shots of the issue you are facing.
Log Files for the Ground truth Job(This logs from your labeling jobs appear in Amazon CloudWatch under the /aws/sagemaker/LabelingJobs group.).

The reason behind this ask is this would help us to understand your use-case in a better way, further if we might need to deep dive and access the job created from our internal tools, we will have more visibility through the support ticket.

Rest assured we will do everything best in our abilities to assist you on this issue. 


answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions