Unable to Annotate PDF files to train Comprehend model

0

I have spent the last couple of days trying to get Amazons semi-structured annotation tool (found here https://github.com/aws-samples/amazon-comprehend-semi-structured-documents-annotation-tools) to work without success due to repeated dependency errors. This seems to be the only way to annotate pdf files suggested by Amazon. Are there other methods/programs that can do this? Generally Amazons documentation and guidelines for data annotation seem poor, referring repeatedly to the ability to annotate pdf's in sagemakers ground truth, but no way to actually do this without being able to get the first tool to work. Have I missed any other documentation? Thanks in advance for help

asked 8 months ago205 views
2 Answers
1

Hello John,

I understand your frustration with the challenges you're facing while trying to set up Amazon's semi-structured annotation tool. Annotating PDF files for training Amazon Comprehend Custom Entity Recognition models can be a complex process, and alternative solutions can be explored. Here are a few alternative methods to consider:

  1. Amazon SageMaker Ground Truth:

    While you mentioned issues with the annotation tool, Amazon SageMaker Ground Truth is a service specifically designed for data labeling and annotation. It can handle various types of data, including text data for custom entity recognition. You can create custom labeling jobs to annotate your PDF files. If you encountered difficulties with the tool you mentioned, you might find SageMaker Ground Truth more user-friendly and better documented.

  2. Custom Annotation Tools:

    If you have a team of annotators and a set of PDFs to annotate, you can consider developing your custom annotation tool or using third-party annotation tools that provide PDF annotation capabilities. Tools like Labelbox, Prodigy, or Doccano offer customizable solutions for text annotation, including PDFs.

  3. Data Preprocessing:

    Before using Amazon Comprehend for training custom entity recognition models, you may need to preprocess your PDFs. Convert them into plain text or structured formats (e.g., JSON, CSV) and then annotate the structured data. Tools like PDFMiner can help extract text and structure from PDF files.

  4. Third-Party Services:

    Some third-party services specialize in data annotation and can help with annotating your PDF files. They often have annotation platforms that allow you to upload and annotate data, including text in PDFs. Examples include Appen and Scale.

  5. Consulting AWS Support:

    If you're determined to use the Amazon Comprehend semi-structured annotation tool or are facing specific issues, consider reaching out to AWS Support for assistance. They can provide guidance and help troubleshoot the dependency errors you encountered.

When exploring these alternatives, consider your specific requirements, budget, and the volume of data you need to annotate. Custom solutions may require more development effort but can be tailored to your exact needs.

Finally, keep an eye on updates and improvements to AWS services and documentation, as AWS continuously improves its offerings and may provide better annotation tools or guides in the future.

Please give a thumbs up if my suggestion helps

profile picture
answered 8 months ago
  • Thanks for the response. As I understand it, correct annotation of pdf files requires the location of the entity as well as the entity itself. Ground Truth doesn't provide this option, just finding entity pairs from the text extracted from the pdf. Is this correctly understood or do you only need the text to train comprehend on pdfs? If the latter, then it isn't a problem to extract the text for training.

0

Is the tool being setup on a Linux or Windows system? If you are having problems, please cut a ticket for support with setting up the tool.

AWS
answered 12 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions