Boto3 Textract start_document_analysis response changes breaking existing implementation

0

Even after specifying boto3 to 1.19.5 in lambda, We are getting latest boto3 version response for start_document_analysis method. Is there a way to get old response structure for start_document_analysis method.

Earlier we used to get only one table per page, with latest fix(https://github.com/boto/boto3/blob/develop/CHANGELOG.rst#1216) and we are getting multiple tables for the same page for older version for boto3.

Please do let us know how to get older response structure.

asked 2 years ago265 views
1 Answer
1

Textract did update the table model to support merged_cells and table_headers. https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-textract-updates-tables-check-detection/

The update adds a new BlockType called "MERGED_CELLS" and Relationships Type "MERGED_CELL" and an EntityType "COLUMN_HEADER". If you don't need those, you can ignore them.

Outside of those additions the response is the same as the "older" one with all CELLs of a TABLE being the CHILD Relationship. See: https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html

I recommend using https://pypi.org/project/amazon-textract-response-parser/ for parsing the response in Python.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions