custom entity recognition output coming incorrect

0

this is a sample pdf of my data and all the pdfs have same format: Enter image description here Now I am first using textract on this and then through the amazon sagemaker platform I am annotating the data. As it can be seen there is an issue date and an effective date. The output from textract text is: Enter image description here here each line is considered an object so I labeled the first date as publishdate and the second date as effectivedate and trained my comprehend cer model. After passing a similar format test text into the comprehend model, the output I got had both the dates marked as effectivedate. like this: Enter image description here but I want the first date to be publishdate and 2nd date to be the effectivedate. I made sure to clean the data and made sure that all the labeling was done correctly. How should I go about solving this problem? My organization's architecture is that pdfs will come to textract and then the text extracted will be passed to the comprehend for entity recognition.

1개 답변
0

I think what you're looking for is layout-aware entity recognition, considering the actual structure of the document instead of extracting it to flat text first.

Comprehend can do this (since 2021) if you train your model using document annotations instead of plain-text annotations - as discussed in this two-part blog series: 1. annotate and 2. train. Using this approach, you can define separate entity types for your 'issue date' and 'effective date' and then train a model to extract each one by example. You'll need to annotate at least 250 documents to get started though.

However, if your documents are quite clearly formatted like the example you shared, you might find it easier to tackle with Amazon Textract Queries instead. With Queries you don't need to train anything, just provide natural language questions when you submit your document e.g. What is the issue date? and What is the effective date? and the model will try to extract the relevant answers for you. You can try out Queries quickly & easily from the Amazon Textract Console, to get an idea whether it can perform well on the kinds of extraction you want to do.

Queries performs best when your question clearly correlates to the phrasing of the document, so if you have documents in many different formats it might be hard to come up with queries that perform well across all of them: That's the point where entity recognition might perform better, because you can just annotate the docs and fine-tune a model yourself to reliably pick up the fields across the whole set.

AWS
전문가
Alex_T
답변함 일 년 전
  • Thank you for your response. My initial approach was following the two-part blog series you have mentioned but I am stuck at a groundtruth labeling job creation issue similar to https://repost.aws/questions/QUn7gIM_MkSHmd9IzuV4_pmw/input-manifest-errors-in-sagemaker-ground-truth-for-custom-labeling-job . This was the most basic pdf and there are complicated ones for each I will be making a separate cer model. I am thinking instead of using textract I will use PyPdf2 python library for text extraction which gives 'issue date: dd/mm/yyyy' and same for effective as a single line. Will that work?

  • Hmm, difficult to diagnose what's up with the labelling job without example entries from the manifest (which I guess may also be why that other question wasn't answered?) Yes you could go with text-only models or non-OCR methods as you say, but the main drawback is that they'd be more fragile: For e.g. PyPDF2 relies on digital text so wouldn't work with scans, and writing rule-based extraction logic would fail if you need to process docs of slightly different format. For an alternative ML-based method you could also check out: https://github.com/aws-samples/amazon-textract-transformer-pipeline

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠