Amazon Textract

0

We are using AWS Textract modules Analyse Document to extract data from enrollment forms which are in pdf and jpeg format. We observed that for online filled pdf forms, textract is giving incorrect response for single select radio buttons/check boxes. and not able to recognize few text fields.

Please find below the attached screenshot of online filled PDF form where single select checkbox is selected but textract is giving "Not Selected" in response.

Enter image description here

If we upload the same form in JPEG format, textract is able to recognize the single select radio/checkboxes., first name , last name as well. and giving correct response. Enter image description here

How can we fix this? Please help us on resolving these issues as it is not giving correct response..

已提问 2 个月前151 查看次数
1 回答
0

To fix this issue with Textract, ensure the following:

  • Ensure the document uses a language supported by Textract (English, Spanish, Italian, Portuguese, French, German). Accuracy may be lower for other languages.
  • Provide high quality images (150DPI or higher) in a format like PDF, JPEG or PNG. Converting or downsampling the image before analysis could impact results.
  • Single select radio buttons and checkboxes can sometimes be challenging for Textract to interpret correctly. You may need to do additional post-processing on the results to determine which option was actually selected.
  • If certain text fields are not being recognized, the font, size or layout of those fields may make them harder to extract. Try preprocessing the document or that portion of the document before analyzing to clean it up.
  • The Textract console provides bounding box information that can help validate extractions. You can also download the full JSON response for deeper analysis as needed.
profile picture
专家
已回答 2 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则