- Newest
- Most votes
- Most comments
Dear Customer,
Thank you so much for reaching to us. I understand that you followed our AWS Comprehend documentation for annotating PDF’s, in-order to annotate your training PDFs in SageMaker Ground Truth. However, there were some issues with the tool showing a blank page in the UI, and you are assuming it may be an issue with the input manifest. Hence you were looking for guidance in resolving this issue.
Thank you so much for providing the details.
To further better assist you on this issue, can you please create a Support Ticket to AWS. Below link will assist you to create the Support Ticket. [+]https://docs.aws.amazon.com/awssupport/latest/user/case-management.html [+]https://console.aws.amazon.com/support/home#/case/create
—While creating the support ticket, we kindly request you to provide the below information
Use-case description.
Ground Truth Job ARN details
Screen-shots of the issue you are facing.
Log Files for the Ground truth Job(This logs from your labeling jobs appear in Amazon CloudWatch under the /aws/sagemaker/LabelingJobs group.).
The reason behind this ask is this would help us to understand your use-case in a better way, further if we might need to deep dive and access the job created from our internal tools, we will have more visibility through the support ticket.
Rest assured we will do everything best in our abilities to assist you on this issue. Thanks.
I have found out that when this tutorial is carried out on windows, you get an error after creating the job, which is the labeling portal UI is blank. This happens because for some reason in the s3 bucket created here, when you go here comprehend-semi-structured-docs-ui-template/ and then inside the folder of your job, you will see that the name of S3 folders is having some formatting issue. Instead of '/' that represents folder layer, the bucket name is using '' and making the request failed. Thus, it is recommended to rearrange or rename your s3 bucket folders accordingly and this solves the issue.
Relevant content
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 months ago
I've done some further research into the matter as well as some experimentation.
{"source-ref": "s3://comprehend-semi-structured-docs-us-east-2-451256804668/src/Fiserv_10-K_2017.pdf", "page": "1", "metadata": {"pages": "82", "use-textract-only": false, "labels": ["Legal", "Contact", "Service", "Technology", "Risk", "Money", "VIP", "Ethic", "Health"]}, "annotator-metadata": {"info": "sample", "Due Date": "12/12/2023"}, "primary-annotation-ref": null, "secondary-annotation-ref": null}
This is an example of one line in my manifest.
I've been using JSONLint to verify the JSON structure of my document, and the initial manifest does not meet the criteria. It reads the first object, then throws an EOF error as it there is no comma nor top level object to contain all the other objects.
Yet as referenced here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html This is the proper way to construct the manifest document. I've attempted creating a list for the objects to reside in, as well as a top level object called 'Sources' to hold all of the other objects. Neither solution has worked so far.
I have done some more research after some extensive testing.
It seems that there may be something wrong with my pre-processing lambda. Currently when investigating cloudwatch, I am showing my poppler layer as having read and execute access; but not write access.
This makes sense, as the poppler layer is responsible for PDFPlumber which is responsible for interacting with the actual pdf page. If the Poppler layer doesn't have write access, it would make sense that the label jobs would be submitted without any content.
My challenge now is to figure out how to change these permissions.
In addition to this, I've found that my JSON file is being 'read' when I submit an API-Call to Sagemaker for a labeling job. It seems as if that when I submit a job through the console, the initial JSON Parse always fails. I've tested this with other JSON files formatted in several different ways (according to JSON Lines and AWS examples) and they never seem to work.
I am unsure if this is a bug or if there is something wrong with my setup, but it seems to function as intended when submitted through the CLI as an API-Request.