Textract table extraction, splitting the table into two horizontal parts. How to get past this.


Since, the March 2022 release update of AWS Textract (you can find the announcement here: https://aws.amazon.com/about-aws/whats-new/2021/04/amazon-textract-announces-quality-update-table-extraction-feature/), the AnalyzeDocument API has shown an unexpected behavior.
I have been trying to process a single page pdf document through textract, which contains just one table of 40 rows and 25 columns. In the output, I happen to see that textract is breaking the table at the 4th column , dividing it into 2 tables (a left table with 3 columns, and a right table with remaining columns). This break in between is causing my accuracy to disrupt in the output. Surprisingly, this behavior was not observed before the latest release. I can say this, as I hold the outputs from the last release as well, in that the output was one single table like that of the source document. Can someone from the community help me on this, are there some extra hyper parameter which I have to use while calling textract, to get past this break which is happening. Thanks in advance.

2 Answers

I am not aware of any parameter to avoid table splitting. The new release improves tables rows and column boundaries detection and you should see these improvements for many documents. However, I can suggest to have a look into this blog post which describe how to merge tables with postprocessing. The solution described in the blog post merge tables from different pages, although you can use the same idea and adapt the code to solve your issue.

answered 2 years ago

Hi, AWS-User-9613621

Thank you for using Amazon Textract. I am sorry to hear to that you are seeing issues with our latest table model. It would be helpful if you could provide us with sample file for which you're seeing issues, AWS account ID, and region you are operating in. You could contact AWS support to share the details. This will help us in identifying the issue.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions