Help with Textract using Java 1.x SDK

0

I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous route, I do get KEY_VALUE_SET back but results are completely inaccurate.

Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?

Sample Code below: ` StartDocumentAnalysisRequest req = new StartDocumentAnalysisRequest() .withFeatureTypes(FeatureType.FORMS) .withDocumentLocation(new DocumentLocation() .withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));

        StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
        String startJobId = startDocumentAnalysisResult.getJobId();

        GetDocumentAnalysisResult documentAnalysisResult = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            try {
                TimeUnit.SECONDS.sleep(10);
                GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
                        .withJobId(startJobId)
                        .withMaxResults(1);

                documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
                jobStatus = documentAnalysisResult.getJobStatus();
            } catch (Exception e) {
                logger.error(e);
            }
        }

        if (!jobStatus.equals("IN_PROGRESS")) {
                List<Block> blocks = documentAnalysisResult.getBlocks();
                logger.error("block list size " + blocks.size());

                Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
                Map<String, Block> keyMap = new HashMap<>();
                Map<String, Block> valueMap = new HashMap<>();
                Map<String, Block> blockMap = new HashMap<>();

                for (Block block : blocks) {
                    logger.error("Block Type:" + block.getBlockType());
                    String blockId = block.getId();
                    blockMap.put(blockId, block);
                    if (block.getBlockType().equals("KEY_VALUE_SET")) {
                        if (block.getEntityTypes().contains("KEY")) {
                            keyMap.put(blockId, block);
                        } else {
                            valueMap.put(blockId, block);
                        }
                    }
                }
                keyValueBlockMap.put("keyMap", keyMap);
                keyValueBlockMap.put("valueMap", valueMap);
                keyValueBlockMap.put("blockMap", blockMap);

                Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
                for (String key : keyValueRelationShip.keySet()) {
                    logger.error("Key: " + key);
                    logger.error("Value: " + keyValueRelationShip.get(key));
                }
            }`

Synchronous path which results and completely horrible results

AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName()))); AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);

질문됨 2년 전338회 조회
2개 답변
0

Hi, it'd be great if you could you provide us with a sample file for which you are facing the issues and your AWS account id and the region you are operating in. You could contact AWS support to share the details. This will help us in identifying the issue.

답변함 2년 전
  • Thanks I've opened a support case and will update here if it ends up being something on my side.

0

Hi, Thank you for using Amazon Textract, and I am sorry to hear that you are facing some issues with our APIs. We are investigating this issue, and I'll update the thread once I have more details.

AWS
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠