UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

1

Hi, I have a multi-page PDF document which I can process fine and extract key value pair in Amazon Textract web interface. However, when I try to extract key value pairs in my Python code, it returns below error: -

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Below is my code: -

    response = textract.analyze_document(
        Document={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        FeatureTypes=["FORMS"],
        HumanLoopConfig={
            "HumanLoopName": uuid.uuid4().hex,
            "FlowDefinitionArn": FLOW_ARN,
            "DataAttributes": {
                "ContentClassifiers": [
                    "FreeOfPersonallyIdentifiableInformation",
                    "FreeOfAdultContent",
                ]
            },
        },
    )
    print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

I thought because my PDF document is multi-page maybe because of that it is not able to read it so I tried to read the PDF page by page and modified my code to below: -

    # Start document text detection
    response = textract.start_document_text_detection(
        DocumentLocation={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        ClientRequestToken=str(uuid.uuid4())  # Generate a unique client request token
    )
    
    # Retrieve the job ID from the response
    job_id = response["JobId"]
    
    # Poll for the completion of the job
    while True:
        job_status = textract.get_document_text_detection(JobId=job_id)['JobStatus']
        if job_status in ['SUCCEEDED', 'FAILED']:
            break
        time.sleep(5)  # Wait for 5 seconds before checking again
    
    # Get the results of the detection
    response = textract.get_document_text_detection(JobId=job_id)
    
    # Process each page of the document
    for page_result in response['Blocks']:
        if page_result['BlockType'] == 'PAGE':
            page_number = page_result['Page']
            response = textract.analyze_document(
                Document={
                    "S3Object": {
                        "Bucket": bucketname,
                        "Name": filename,
                    }
                },
                FeatureTypes=["FORMS"],
                HumanLoopConfig={
                    "HumanLoopName": uuid.uuid4().hex,
                    "FlowDefinitionArn": FLOW_ARN,
                    "DataAttributes": {
                        "ContentClassifiers": [
                            "FreeOfAdultContent",
                        ]
                    },
                },
            )
            print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

However, I am still getting the same UnsupportedDocumentException error.

Any help or pointers would be appreciated.

Thanks

Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande