Amazon Textract OCR Issues with Mixed Language Documents

0

Hi Team,

I'm experiencing an issue with Amazon Textract when processing documents that contain both vernacular (non-English) and English handwritten text. The OCR results for the English handwritten text are often incorrect, and it appears that the OCR might be performed on a rotated image.

Here are the details:

When the document contains mixed languages (vernacular and English), the OCR results for English handwritten text are inaccurate. However, if I remove the vernacular language from the document and only submit the English text, the OCR results are accurate. This issue leads me to believe that the presence of the vernacular language might be causing Textract to incorrectly interpret the orientation of the English text. Is there any way to address this issue, or are there specific settings that I should adjust in Textract to improve the accuracy of OCR results for mixed-language documents?

Please fine attached word for which I am getting the OCR outpu as "Shio". For reference I am attaching the single word, but it is behaving this for entire document as same way

Reference Word

Any advice or solutions would be greatly appreciated.

Thank you!

DeepS
asked 2 months ago67 views
1 Answer
0

Preprocess the Document:

Separate the Text Blocks: If possible, preprocess the document to separate vernacular and English text into different images or sections before sending them to Textract. You could then run the OCR on each section individually and combine the results. Manual Orientation Correction: Ensure that the document is correctly oriented before sending it to Textract. You can use image processing tools to detect and correct any misalignment or rotation.

Use Language-Specific OCR Models:

While Textract doesn't allow direct selection of language models, you can preprocess the text by using other OCR tools specifically tuned for the vernacular language and English separately. You can then merge the results manually.

Custom OCR Models:

If this is a recurring issue, you might want to consider training a custom OCR model that is specifically tuned for your use case, handling mixed languages and handwritten text better than the general-purpose model in Textract.

Post-processing:

Implement a post-processing step that checks the OCR output for common errors, especially when mixing languages, and corrects them based on the expected language or context.

Isolate the Word: Test OCR on the word in isolation, which you mentioned works well. This further supports the hypothesis that the mixed language is causing the issue.

Test with Different Configurations: Experiment with different Textract features, such as setting a specific FeatureType (like "FORMS" or "TABLES") to see if it affects the accuracy.

Unfortunately, Textract doesn't allow much fine-tuning of OCR settings directly through the API, but these workarounds might help improve accuracy for your use case.

profile pictureAWS
EXPERT
Deeksha
answered 2 months ago
  • I am sending the correct Oriented document only but unfortunately Textract itself rotating the document internally and returning the incorrect text. Also, I can pass separate words but that will increase the N number of request and billing.

    I am using the Textract service to detect the word itself and for that I am passing the entire document. So, its not possible to Isolate the Word. Even I noticed that if I whiteout most of the Vernacular words and keeping only couple of in that case its also returning wrong results (like flip text) for English Handwritten.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions