Analyze Document for Malaysia Identity card


Hi, currently I'm doing data extraction for the Malaysia's identity card using this sync function, AnalyzeDocument. However the AWS Textract not able to extract meaningful result using FORMS feature type. Hence I'm using QUERIES feature to extract the person's name, person's identity number and person's customer address. In this case, the accuracy is lower whereby more than 50% of the result are useless. Please advise on which extraction type can be used to increase the accuracy. Thanks.

asked 2 months ago65 views
2 Answers

There are two different kinds of accuracy issues you might be struggling with here:

  1. Failure to identify the required items/entities: For example "This is the date of birth", "this is the name", etc - expected key not being detected, or Query not pulling back the correct part of the document
  2. Failure to correctly recognise the text: For example "Substituting 0 for O in ID number" - KV/query pulling back the correct part of the document, but incorrect OCR

Identifying entities

In case you're struggling with (1) and have already tried Queries, I'd suggest to try training an Amazon Comprehend Native Document NER model. You can find the linked blogs for collecting training data and building the model. When trained with actual document annotations (instead of plain text), Comprehend entity recognition is capable of accounting for both the page position and the text content to detect entities in documents: Which works much better for these kind of tasks than a plain text entity recognizer model.

Technically, this feature currently only supports English - but:

  • I've heard anecdotes of customers getting good results in other languages, especially with documents (like this use case) where the format is very consistent so the model can rely heavily on position cues
  • If you'd like to explore a self-managed, multi-lingual, layout-aware model on Amazon SageMaker, there's an end-to-end sample here that combines Amazon Textract with Hugging Face LayoutXLM. This sample, and the general value of using layout-aware models for document processing, is discussed further in this AWS ML Blog post.

Correcting OCR

Entity recognition models like those mentioned above just "tag" text so cannot generally "fix" OCR error patterns like substituted similar characters or omitted characters. If you're struggling mainly with (2), there are multiple approaches you could consider:

  • Pre-processing or (validating & rejecting) images to improve image quality: For example boosting contrast, detecting and removing skew, detecting blur or glare, or validating/normalizing pixel resolution of the area of interest (the identity card). This could range from using standard image processing libraries like Pillow, to using ML models to localize the region of interest (identity card) in the image.
  • Post-processing OCR results to fix common problems:
    • For example using rules and regular expressions to validate things like dates, or ID numbers that include checksums
    • Alternatively, using sequence-to-sequence text models to "translate" raw OCR text into cleaned text and fix common patterns

I've worked with at least one customer in Southeast Asia who found success using a seq2seq BERT model to clean identity card data from OCR errors; and several who've applied rule-based post-processing to improve result accuracy. You could probably extend layout-aware language models to perform this kind of sequence-to-sequence/translation task too, but there's less pre-existing content around that.

answered 2 months ago

Thank you for using Textract. As a machine learning service, Amazon Textract may not be able to achieve desired accuracy on certain documents. Given this our generic models behind Analyze Document may not be working for your use-case. Additionally, our Analyze ID API currently only supports United States ID documents only. However, we are continuously improving the quality of our models. In order to help us improve the models for your documents, please open a customer support ticket and share your documents to help us analyze further. Additionally, please look out for announcements regarding our model quality updates that are announced on the AWS Textract public release channel.

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions