Bug regarding AMAZON TEXTRACT

0

Preamble: We need to extract a unique ID from an image. The image consists of 8 numbers and a character, like so: 99999999C. This code is within a Spanish document in the form of a PNG file, the api used is detect-document-text.

The problem is as follows: When we extract the text from the original PNG, we obtain 999999990; in other words, it converts the last character into a zero. Then, we took a snapshot of the original image and cut 2 cm from the right side, resulting in the correct output.

Afterward, when we cut those 2 cm from the original PNG document, we still get the same inaccurate 999999990 response.

What can we do to get a more accurate result? We have tried cutting the sides programmatically, adding DPI metadata, increasing contrast by 20-40%, grayscaling the doc and programmatically cutting the document to eliminate margins, but the same faulty data is still being extracted.

Edit: the doc is perfectly human readable EDIT 2: DOCUMENT WITH ERROR(THE REAL ONE HAS NO CENSORSHIP BUT THE ERROR PERSISTS) IMAGE WITH ERROR IMAGE WITH NO ERROR(the real one also has no censorship, they have the same quality): image with no error

asked 14 days ago73 views
1 Answer
0

Hi thank you for using textract. We are sorry that you're facing facing regarding accuracy of detection. Would you be able to share document so we can help furhter?

AWS
answered 7 days ago
  • i have updated the question with the censored images

  • also to note, the documents tested have more info with no censorship

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions