custom entity recognition output coming incorrect

0

this is a sample pdf of my data and all the pdfs have same format: Enter image description here Now I am first using textract on this and then through the amazon sagemaker platform I am annotating the data. As it can be seen there is an issue date and an effective date. The output from textract text is: Enter image description here here each line is considered an object so I labeled the first date as publishdate and the second date as effectivedate and trained my comprehend cer model. After passing a similar format test text into the comprehend model, the output I got had both the dates marked as effectivedate. like this: Enter image description here but I want the first date to be publishdate and 2nd date to be the effectivedate. I made sure to clean the data and made sure that all the labeling was done correctly. How should I go about solving this problem? My organization's architecture is that pdfs will come to textract and then the text extracted will be passed to the comprehend for entity recognition.

1 Answer
0

I think what you're looking for is layout-aware entity recognition, considering the actual structure of the document instead of extracting it to flat text first.

Comprehend can do this (since 2021) if you train your model using document annotations instead of plain-text annotations - as discussed in this two-part blog series: 1. annotate and 2. train. Using this approach, you can define separate entity types for your 'issue date' and 'effective date' and then train a model to extract each one by example. You'll need to annotate at least 250 documents to get started though.

However, if your documents are quite clearly formatted like the example you shared, you might find it easier to tackle with Amazon Textract Queries instead. With Queries you don't need to train anything, just provide natural language questions when you submit your document e.g. What is the issue date? and What is the effective date? and the model will try to extract the relevant answers for you. You can try out Queries quickly & easily from the Amazon Textract Console, to get an idea whether it can perform well on the kinds of extraction you want to do.

Queries performs best when your question clearly correlates to the phrasing of the document, so if you have documents in many different formats it might be hard to come up with queries that perform well across all of them: That's the point where entity recognition might perform better, because you can just annotate the docs and fine-tune a model yourself to reliably pick up the fields across the whole set.

AWS
EXPERT
Alex_T
answered a year ago
  • Thank you for your response. My initial approach was following the two-part blog series you have mentioned but I am stuck at a groundtruth labeling job creation issue similar to https://repost.aws/questions/QUn7gIM_MkSHmd9IzuV4_pmw/input-manifest-errors-in-sagemaker-ground-truth-for-custom-labeling-job . This was the most basic pdf and there are complicated ones for each I will be making a separate cer model. I am thinking instead of using textract I will use PyPdf2 python library for text extraction which gives 'issue date: dd/mm/yyyy' and same for effective as a single line. Will that work?

  • Hmm, difficult to diagnose what's up with the labelling job without example entries from the manifest (which I guess may also be why that other question wasn't answered?) Yes you could go with text-only models or non-OCR methods as you say, but the main drawback is that they'd be more fragile: For e.g. PyPDF2 relies on digital text so wouldn't work with scans, and writing rule-based extraction logic would fail if you need to process docs of slightly different format. For an alternative ML-based method you could also check out: https://github.com/aws-samples/amazon-textract-transformer-pipeline

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions