Help with Textract using Java 1.x SDK

0

I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous route, I do get KEY_VALUE_SET back but results are completely inaccurate.

Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?

Sample Code below: ` StartDocumentAnalysisRequest req = new StartDocumentAnalysisRequest() .withFeatureTypes(FeatureType.FORMS) .withDocumentLocation(new DocumentLocation() .withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));

        StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
        String startJobId = startDocumentAnalysisResult.getJobId();

        GetDocumentAnalysisResult documentAnalysisResult = null;

        String jobStatus = "IN_PROGRESS";

        while (jobStatus.equals("IN_PROGRESS")) {
            try {
                TimeUnit.SECONDS.sleep(10);
                GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
                        .withJobId(startJobId)
                        .withMaxResults(1);

                documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
                jobStatus = documentAnalysisResult.getJobStatus();
            } catch (Exception e) {
                logger.error(e);
            }
        }

        if (!jobStatus.equals("IN_PROGRESS")) {
                List<Block> blocks = documentAnalysisResult.getBlocks();
                logger.error("block list size " + blocks.size());

                Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
                Map<String, Block> keyMap = new HashMap<>();
                Map<String, Block> valueMap = new HashMap<>();
                Map<String, Block> blockMap = new HashMap<>();

                for (Block block : blocks) {
                    logger.error("Block Type:" + block.getBlockType());
                    String blockId = block.getId();
                    blockMap.put(blockId, block);
                    if (block.getBlockType().equals("KEY_VALUE_SET")) {
                        if (block.getEntityTypes().contains("KEY")) {
                            keyMap.put(blockId, block);
                        } else {
                            valueMap.put(blockId, block);
                        }
                    }
                }
                keyValueBlockMap.put("keyMap", keyMap);
                keyValueBlockMap.put("valueMap", valueMap);
                keyValueBlockMap.put("blockMap", blockMap);

                Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
                for (String key : keyValueRelationShip.keySet()) {
                    logger.error("Key: " + key);
                    logger.error("Value: " + keyValueRelationShip.get(key));
                }
            }`

Synchronous path which results and completely horrible results

AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName()))); AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);

demandé il y a 2 ans338 vues
2 réponses
0

Hi, it'd be great if you could you provide us with a sample file for which you are facing the issues and your AWS account id and the region you are operating in. You could contact AWS support to share the details. This will help us in identifying the issue.

répondu il y a 2 ans
  • Thanks I've opened a support case and will update here if it ends up being something on my side.

0

Hi, Thank you for using Amazon Textract, and I am sorry to hear that you are facing some issues with our APIs. We are investigating this issue, and I'll update the thread once I have more details.

AWS
répondu il y a 2 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions