- Newest
- Most votes
- Most comments
I think what you're looking for is layout-aware entity recognition, considering the actual structure of the document instead of extracting it to flat text first.
Comprehend can do this (since 2021) if you train your model using document annotations instead of plain-text annotations - as discussed in this two-part blog series: 1. annotate and 2. train. Using this approach, you can define separate entity types for your 'issue date' and 'effective date' and then train a model to extract each one by example. You'll need to annotate at least 250 documents to get started though.
However, if your documents are quite clearly formatted like the example you shared, you might find it easier to tackle with Amazon Textract Queries instead. With Queries you don't need to train anything, just provide natural language questions when you submit your document e.g. What is the issue date?
and What is the effective date?
and the model will try to extract the relevant answers for you. You can try out Queries quickly & easily from the Amazon Textract Console, to get an idea whether it can perform well on the kinds of extraction you want to do.
Queries performs best when your question clearly correlates to the phrasing of the document, so if you have documents in many different formats it might be hard to come up with queries that perform well across all of them: That's the point where entity recognition might perform better, because you can just annotate the docs and fine-tune a model yourself to reliably pick up the fields across the whole set.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago
Thank you for your response. My initial approach was following the two-part blog series you have mentioned but I am stuck at a groundtruth labeling job creation issue similar to https://repost.aws/questions/QUn7gIM_MkSHmd9IzuV4_pmw/input-manifest-errors-in-sagemaker-ground-truth-for-custom-labeling-job . This was the most basic pdf and there are complicated ones for each I will be making a separate cer model. I am thinking instead of using textract I will use PyPdf2 python library for text extraction which gives 'issue date: dd/mm/yyyy' and same for effective as a single line. Will that work?
Hmm, difficult to diagnose what's up with the labelling job without example entries from the manifest (which I guess may also be why that other question wasn't answered?) Yes you could go with text-only models or non-OCR methods as you say, but the main drawback is that they'd be more fragile: For e.g. PyPDF2 relies on digital text so wouldn't work with scans, and writing rule-based extraction logic would fail if you need to process docs of slightly different format. For an alternative ML-based method you could also check out: https://github.com/aws-samples/amazon-textract-transformer-pipeline