Textract/Textractor - Separating table and non-table data

0

I am extracting data from documents that include tables and other text that is not in table format (the documents do not include figures). I would like to separate table data from non-table data because my postprocessing is different for table and non-table data. I am working in Python and using AnalyzeDocument with the TABLE and LAYOUT FeatureTypes to extract the data. However, the LAYOUT data includes the text from the TABLE, which makes it difficult to separate out the non-table data. Can you suggest a way to separate the table data/text from the non-table data/text? Can it be done using FeatureTypes, or does it need to be done at the BLOCK level? Can you point me to any sample code?

asked 23 days ago106 views
1 Answer
0

Hello good afternoon,

Thank you for your question. There is a library published in AWS Samples that can help you called Amazon Textract Textractor. Link: https://github.com/aws-samples/amazon-textract-textractor?tab=readme-ov-file

It has sub modules as described below:

amazon-textract-caller (to simplify calling Amazon Textract without additional dependencies) amazon-textract-response-parser (to parse the JSON response returned by Textract APIs) amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image) amazon-textract-prettyprinter (convert Amazon Textract response to CSV, text, markdown, ...) amazon-textract-geofinder (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

Probably you can use the amazon-textract-response-parser to separate non table data. Check this link: https://pypi.org/project/amazon-textract-response-parser/

Let me know if it helps.

Thank you.

AWS
answered 23 days ago
profile picture
EXPERT
reviewed 23 days ago
  • Yes, thank you. I appreciate the help. I am familiar with the documentation and code samples. I have not come across anything yet that I recognized as a possible solution.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions