Amazon Textract Analyze Document API AWS SDK use

0

I am using Amazon Textract , Analyze Document API , I need sample java code using AWS SDK, how to get Key and Value for FORM type FeatureType. I am able to get the GetDocumentAnalysisResponse, after that I got the List<Block> from response. After get the Block list how to get KEY and VALUE for FORM type values?

Thanks in advance

질문됨 2년 전619회 조회
1개 답변
0

Hello and good evening,

The question is about being unable to identify how to extract the forms information from the document analysis async API response.

Main documentation: Detecting or Analyzing Text in a Multipage Document

Change you need to make In sample code of above documentation, there is a DisplayBlockInfo method within GetDocumentAnalysisResults method. You need to parse the Block object here.

            //Show blocks information
            List<Block> blocks= response.getBlocks();
            for (Block block : blocks) {
                DisplayBlockInfo(block);
            }

The sample java code on how to get the KEY/VALUE information out of the response is in the synchronous Analyze Doc API call sample code documentation. Specifically:

            switch(block.getBlockType()) {
            case "KEY_VALUE_SET":
                if (block.getEntityTypes().contains("KEY")){
                    ShowBoundingBox(height, width, block.getGeometry().getBoundingBox(), g2d, new Color(255,0,0));
                }
                else {  //VALUE
                    ShowBoundingBox(height, width, block.getGeometry().getBoundingBox(), g2d, new Color(0,255,0));
                }
                break;

Please note: KEY_VALUE_SET won't have information about the 'text' in key or 'text' in the value of the key-value pair.

To find key-value pair information:

Whenever block.getEntityTypes().contains("KEY"), then check it's Relationships attribute. It'll have information about the value Block, and about the text within the key itself. Sample relationship object looks like so:

"Relationships": [
                {
                    "Type": "VALUE",
                    "Ids": [
                        "b3d2c8bf-a705-4808-b497-a51426ff27eb"
                    ]
                },
                {
                    "Type": "CHILD",
                    "Ids": [
                        "63f53ef8-5466-44e1-bb75-a8dfb0387248",
                        "aa341b1b-362a-4f51-91ff-c725badcc2dc"
                    ]
                }
            ]

Here, b3d2c8bf-a705-4808-b497-a51426ff27eb is a KEY_VALUE_SET block type, having Relationships information such as:

            "Relationships": [
                {
                    "Type": "CHILD",
                    "Ids": [
                        "aee39cf4-2dbc-4b98-8415-2e5de67913ce",
                        "193a8aa9-9869-4945-8588-c22c60bdc2cd"
                    ]
                }
            ],

Now these CHILD types are the text values within the key/value block themselves.


On a side note: I could notice the official documentation that provides sample java implementation of Async API usage doesn't fully talk about extracting information from the result. The sample code focuses on being able to process multi-page document. The code is generic for any API (with a switch case) - that's why the block information is simply printed instead of parsing the key-value pairs. And that's possibly the reason why customer is confused about how to parse the 'Block' type.

Useful Information about the API Block model

KEY_VALUE_SET - Stores the KEY and VALUE Block objects for linked text that's detected on a document page. Use the EntityType field to determine if a KEY_VALUE_SET object is a KEY Block object or a VALUE Block object.

AWS
Rohan_K
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠