Help with Textract using Java 1.x SDK

0

I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous route, I do get KEY_VALUE_SET back but results are completely inaccurate.

Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?

Sample Code below: ` StartDocumentAnalysisRequest req = new StartDocumentAnalysisRequest() .withFeatureTypes(FeatureType.FORMS) .withDocumentLocation(new DocumentLocation() .withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));

        StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
        String startJobId = startDocumentAnalysisResult.getJobId();

        GetDocumentAnalysisResult documentAnalysisResult = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            try {
                TimeUnit.SECONDS.sleep(10);
                GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
                        .withJobId(startJobId)
                        .withMaxResults(1);

                documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
                jobStatus = documentAnalysisResult.getJobStatus();
            } catch (Exception e) {
                logger.error(e);
            }
        }

        if (!jobStatus.equals("IN_PROGRESS")) {
                List<Block> blocks = documentAnalysisResult.getBlocks();
                logger.error("block list size " + blocks.size());

                Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
                Map<String, Block> keyMap = new HashMap<>();
                Map<String, Block> valueMap = new HashMap<>();
                Map<String, Block> blockMap = new HashMap<>();

                for (Block block : blocks) {
                    logger.error("Block Type:" + block.getBlockType());
                    String blockId = block.getId();
                    blockMap.put(blockId, block);
                    if (block.getBlockType().equals("KEY_VALUE_SET")) {
                        if (block.getEntityTypes().contains("KEY")) {
                            keyMap.put(blockId, block);
                        } else {
                            valueMap.put(blockId, block);
                        }
                    }
                }
                keyValueBlockMap.put("keyMap", keyMap);
                keyValueBlockMap.put("valueMap", valueMap);
                keyValueBlockMap.put("blockMap", blockMap);

                Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
                for (String key : keyValueRelationShip.keySet()) {
                    logger.error("Key: " + key);
                    logger.error("Value: " + keyValueRelationShip.get(key));
                }
            }`

Synchronous path which results and completely horrible results

AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName()))); AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);

已提问 2 年前338 查看次数
2 回答
0

Hi, it'd be great if you could you provide us with a sample file for which you are facing the issues and your AWS account id and the region you are operating in. You could contact AWS support to share the details. This will help us in identifying the issue.

已回答 2 年前
  • Thanks I've opened a support case and will update here if it ends up being something on my side.

0

Hi, Thank you for using Amazon Textract, and I am sorry to hear that you are facing some issues with our APIs. We are investigating this issue, and I'll update the thread once I have more details.

AWS
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则