Textract Json extracts

0

Hello. Using Textract, I am able to OCR extract the data in the tables / forms to json downloads. However, I have observed that in json file, every word and statement in a line has a unique ID created. If I need to read the json data to my application with fields like address, how can we read the data from json which is in different lines. Is there a way to connect IDs of a specific section like address to one Id and read that Id to my application.

Rajesh
asked 10 months ago1222 views
3 Answers
0

This may help you.

https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-comprehend-custom-entities-images-json/

Extract custom entities from images and Textract JSON files with Amazon Comprehend

Posted On: Mar 24, 2022

Amazon Comprehend now supports documents in image formats in addition to text, PDFs, and Word. Customers can now use Comprehend custom entity recognition to extract entities from image files (JPG, PNG, TIFF) and can also use Comprehend directly on Amazon Textract JSON outputs to extract custom entities from documents. With this launch customers can simplify their intelligent document processing (IDP) workflows, taking advantage of an out-of-the-box integration between Comprehend and Textract to extract entities from documents. Below is a detailed description of these features:

Custom NER on image files - Amazon Comprehend previously launched custom entity recognition support for PDF and Word documents (see announcement for details). Starting today, customers can use Comprehend to also extract information from documents in image files (JPG, PNG, TIFF) to further support diverse document processing workflows. This feature removes the need of post-processing OCR output prior to completing entity extraction with Comprehend. Customers first annotate and train a custom entity recognition model on PDF documents. The trained custom entity recognition model leverages both the natural language and positional information (e.g. coordinates) of the text to accurately extract custom entities from PDF, Word, plain text, and now, image formats during inference. See documentation for more details.

Custom NER on Textract JSON outputs - Starting today, customers can use their Textract DetectDocumentText or AnalyzeDocument JSON outputs as an input during Comprehend custom NER inference. By leveraging an existing Textract output, customers can further simplify their document processing workflows (saving time and money), and extend their workflows to extract custom entities from a broader set of documents. See documentation for more details.

To learn more and get started, visit the Amazon Comprehend product page.

profile pictureAWS
EXPERT
iBehr
answered 10 months ago
0

Hi,

You may also take a look at the PyPi implementation of Amazon Textract response parser. Below is a brief context of the Amazon Textract Response Parser PyPi library

You can use Textract response parser library to easily parser JSON returned by Amazon Textract. Library parses JSON and provides programming language specific constructs to work with different parts of the document. textractor is an example of PoC batch processing tool that takes advantage of Textract response parser library and generate output in multiple formats.

Link https://pypi.org/project/amazon-textract-response-parser/

AWS
answered 10 months ago
0

In this article :

[TLDR]

You can easily take advantage of Amazon Textract API operations using the AWS SDK to build power-smart applications. We also use Amazon Textract Helper, Amazon Textract Caller, Amazon Textract PrettyPrinter[1], and Amazon Textract Response Parser for some of the following use cases. These packages are published to PyPI to speed up development and integration even further.

[1] https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

[2] https://github.com/aws-samples/amazon-textract-response-parser

AWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions