Help with Textract using Java 1.x SDK

0

I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous route, I do get KEY_VALUE_SET back but results are completely inaccurate.

Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?

Sample Code below: ` StartDocumentAnalysisRequest req = new StartDocumentAnalysisRequest() .withFeatureTypes(FeatureType.FORMS) .withDocumentLocation(new DocumentLocation() .withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));

        StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
        String startJobId = startDocumentAnalysisResult.getJobId();

        GetDocumentAnalysisResult documentAnalysisResult = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            try {
                TimeUnit.SECONDS.sleep(10);
                GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
                        .withJobId(startJobId)
                        .withMaxResults(1);

                documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
                jobStatus = documentAnalysisResult.getJobStatus();
            } catch (Exception e) {
                logger.error(e);
            }
        }

        if (!jobStatus.equals("IN_PROGRESS")) {
                List<Block> blocks = documentAnalysisResult.getBlocks();
                logger.error("block list size " + blocks.size());

                Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
                Map<String, Block> keyMap = new HashMap<>();
                Map<String, Block> valueMap = new HashMap<>();
                Map<String, Block> blockMap = new HashMap<>();

                for (Block block : blocks) {
                    logger.error("Block Type:" + block.getBlockType());
                    String blockId = block.getId();
                    blockMap.put(blockId, block);
                    if (block.getBlockType().equals("KEY_VALUE_SET")) {
                        if (block.getEntityTypes().contains("KEY")) {
                            keyMap.put(blockId, block);
                        } else {
                            valueMap.put(blockId, block);
                        }
                    }
                }
                keyValueBlockMap.put("keyMap", keyMap);
                keyValueBlockMap.put("valueMap", valueMap);
                keyValueBlockMap.put("blockMap", blockMap);

                Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
                for (String key : keyValueRelationShip.keySet()) {
                    logger.error("Key: " + key);
                    logger.error("Value: " + keyValueRelationShip.get(key));
                }
            }`

Synchronous path which results and completely horrible results

AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName()))); AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);

已提問 2 年前檢視次數 338 次
2 個答案
0

Hi, it'd be great if you could you provide us with a sample file for which you are facing the issues and your AWS account id and the region you are operating in. You could contact AWS support to share the details. This will help us in identifying the issue.

已回答 2 年前
  • Thanks I've opened a support case and will update here if it ends up being something on my side.

0

Hi, Thank you for using Amazon Textract, and I am sorry to hear that you are facing some issues with our APIs. We are investigating this issue, and I'll update the thread once I have more details.

AWS
已回答 2 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南