Textract Json extracts

0

Hello. Using Textract, I am able to OCR extract the data in the tables / forms to json downloads. However, I have observed that in json file, every word and statement in a line has a unique ID created. If I need to read the json data to my application with fields like address, how can we read the data from json which is in different lines. Is there a way to connect IDs of a specific section like address to one Id and read that Id to my application.

Rajesh
질문됨 10달 전1274회 조회
3개 답변
0

This may help you.

https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-comprehend-custom-entities-images-json/

Extract custom entities from images and Textract JSON files with Amazon Comprehend

Posted On: Mar 24, 2022

Amazon Comprehend now supports documents in image formats in addition to text, PDFs, and Word. Customers can now use Comprehend custom entity recognition to extract entities from image files (JPG, PNG, TIFF) and can also use Comprehend directly on Amazon Textract JSON outputs to extract custom entities from documents. With this launch customers can simplify their intelligent document processing (IDP) workflows, taking advantage of an out-of-the-box integration between Comprehend and Textract to extract entities from documents. Below is a detailed description of these features:

Custom NER on image files - Amazon Comprehend previously launched custom entity recognition support for PDF and Word documents (see announcement for details). Starting today, customers can use Comprehend to also extract information from documents in image files (JPG, PNG, TIFF) to further support diverse document processing workflows. This feature removes the need of post-processing OCR output prior to completing entity extraction with Comprehend. Customers first annotate and train a custom entity recognition model on PDF documents. The trained custom entity recognition model leverages both the natural language and positional information (e.g. coordinates) of the text to accurately extract custom entities from PDF, Word, plain text, and now, image formats during inference. See documentation for more details.

Custom NER on Textract JSON outputs - Starting today, customers can use their Textract DetectDocumentText or AnalyzeDocument JSON outputs as an input during Comprehend custom NER inference. By leveraging an existing Textract output, customers can further simplify their document processing workflows (saving time and money), and extend their workflows to extract custom entities from a broader set of documents. See documentation for more details.

To learn more and get started, visit the Amazon Comprehend product page.

profile pictureAWS
전문가
iBehr
답변함 10달 전
0

Hi,

You may also take a look at the PyPi implementation of Amazon Textract response parser. Below is a brief context of the Amazon Textract Response Parser PyPi library

You can use Textract response parser library to easily parser JSON returned by Amazon Textract. Library parses JSON and provides programming language specific constructs to work with different parts of the document. textractor is an example of PoC batch processing tool that takes advantage of Textract response parser library and generate output in multiple formats.

Link https://pypi.org/project/amazon-textract-response-parser/

AWS
답변함 10달 전
0

In this article :

[TLDR]

You can easily take advantage of Amazon Textract API operations using the AWS SDK to build power-smart applications. We also use Amazon Textract Helper, Amazon Textract Caller, Amazon Textract PrettyPrinter[1], and Amazon Textract Response Parser for some of the following use cases. These packages are published to PyPI to speed up development and integration even further.

[1] https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

[2] https://github.com/aws-samples/amazon-textract-response-parser

AWS
답변함 10달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠