Textract: Identify checkbox Selection/deselection accurately

0

Hi, I facing the following issue with OCR-ing the checkboxes in a table in a PDF using StartDocumentAnalysis (async OCR) in Textract. My PDF has the 3 checkboxes ( Image of 3 checkboxes in the last 3 columns of a table of 6 columns. In the OCR JSON response from Textract we observe that the checkboxes' confidence levels of the containing cell is better (> 90%) than that of the actual checkbox itself (<75%). This occurs for both the selected and the non-selected checkboxes. e.g. The JSON of containing cell has confidence > 90% -

{
      "BlockType": "CELL",
   **   "Confidence": 90.72265625,**
      "RowIndex": 2,
      "ColumnIndex": 4,
      "RowSpan": 1,
      "ColumnSpan": 1,
      "Geometry": {
        "BoundingBox": {
          "Width": 0.19133399426937103,
          "Height": 0.03532093018293381,
          "Left": 0.7058268189430237,
          "Top": 0.3528282940387726
        },
        "Polygon": [
          {
            "X": 0.7058268189430237,
            "Y": 0.3528819978237152
          },
          {
            "X": 0.8971458673477173,
            "Y": 0.3528282940387726
          },
          {
            "X": 0.8971608281135559,
            "Y": 0.38809624314308167
          },
          {
            "X": 0.705838501453399,
            "Y": 0.388149201869964
          }
        ]
      },
      "Id": "6107a30f-2b80-45ec-9dc4-202de522bfb4",
      "Relationships": [
        {
          "Type": "CHILD",
          "Ids": [
            "a2971151-e92f-4f2f-b7b2-bf2cc1b3946a",
            "2432b2e4-3ab4-4f9e-ad5b-04da0369b7a8"
          ]
        }
      ],
      "Page": 1
    }

But the inner checkbox of the cell has lower confidence (< 75%) -

{
      "BlockType": "SELECTION_ELEMENT",
    **  "Confidence": 73.828125,**
      "Geometry": {
        "BoundingBox": {
          "Width": 0.02931755594909191,
          "Height": 0.019549237564206123,
          "Left": 0.7678554058074951,
          "Top": 0.36045578122138977
        },
        "Polygon": [
          {
            "X": 0.7678554058074951,
            "Y": 0.3604639768600464
          },
          {
            "X": 0.7971656322479248,
            "Y": 0.36045578122138977
          },
          {
            "X": 0.7971729636192322,
            "Y": 0.3799968659877777
          },
          {
            "X": 0.7678624391555786,
            "Y": 0.38000503182411194
          }
        ]
      },
      "Id": "2432b2e4-3ab4-4f9e-ad5b-04da0368b7a8",
      "SelectionStatus": "SELECTED",
      "Page": 1
    }

Please confirm how we can improve the confidence level of the checkboxes given that the PDF in the image attached (above) shows that checkbox fairly clear (visually) Thanks.

asked a month ago33 views
1 Answer
0

To improve the confidence level of checkbox detection and selection status in Amazon Textract, you can try several approaches:

  1. Image Quality: Ensure that the input document has high image quality. Higher resolution and clearer images generally lead to better recognition accuracy. If possible, try to improve the quality of the source document.

  2. Contrast: Enhance the contrast between the checkbox and its background. This can make it easier for Textract to identify the checkbox boundaries and selection status.

  3. Checkbox Size: Make sure the checkboxes are of a reasonable size. If they are too small, it may be difficult for Textract to accurately detect them.

  4. Consistent Formatting: Use consistent checkbox styles throughout the document. This can help Textract learn and recognize the pattern more effectively.

  5. Clear Markings: Ensure that the markings inside the checkboxes (for selected boxes) are clear and distinct. Avoid faint or partial markings.

  6. Avoid Overlapping: Make sure the checkboxes don't overlap with other elements in the document, such as text or lines.

  7. Pre-processing: Consider pre-processing the image before submitting it to Textract. This could involve techniques like noise reduction or sharpening to make the checkboxes more distinct.

  8. Custom Queries: If you're using the Queries feature of Amazon Textract, you can create custom queries to specifically target the checkbox areas. This can sometimes yield better results than relying on automatic detection alone.

  9. Multiple Attempts: If possible, try processing the document multiple times and compare the results. Textract's ML models are continuously improving, so results may vary slightly between attempts.

  10. Feedback to AWS: If you consistently encounter low confidence scores for clear checkboxes, consider providing feedback to the AWS Textract team. They may be able to offer more specific advice or use your feedback to improve the service.

Remember that while these steps can help improve recognition accuracy, some variation in confidence scores is normal due to the nature of machine learning-based recognition systems.
Sources
Recognition of '1' from table filled in by hand | AWS re:Post
Amazon Textract Features | AWS

profile picture
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions