UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

1

Hi, I have a multi-page PDF document which I can process fine and extract key value pair in Amazon Textract web interface. However, when I try to extract key value pairs in my Python code, it returns below error: -

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Below is my code: -

    response = textract.analyze_document(
        Document={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        FeatureTypes=["FORMS"],
        HumanLoopConfig={
            "HumanLoopName": uuid.uuid4().hex,
            "FlowDefinitionArn": FLOW_ARN,
            "DataAttributes": {
                "ContentClassifiers": [
                    "FreeOfPersonallyIdentifiableInformation",
                    "FreeOfAdultContent",
                ]
            },
        },
    )
    print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

I thought because my PDF document is multi-page maybe because of that it is not able to read it so I tried to read the PDF page by page and modified my code to below: -

    # Start document text detection
    response = textract.start_document_text_detection(
        DocumentLocation={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        ClientRequestToken=str(uuid.uuid4())  # Generate a unique client request token
    )
    
    # Retrieve the job ID from the response
    job_id = response["JobId"]
    
    # Poll for the completion of the job
    while True:
        job_status = textract.get_document_text_detection(JobId=job_id)['JobStatus']
        if job_status in ['SUCCEEDED', 'FAILED']:
            break
        time.sleep(5)  # Wait for 5 seconds before checking again
    
    # Get the results of the detection
    response = textract.get_document_text_detection(JobId=job_id)
    
    # Process each page of the document
    for page_result in response['Blocks']:
        if page_result['BlockType'] == 'PAGE':
            page_number = page_result['Page']
            response = textract.analyze_document(
                Document={
                    "S3Object": {
                        "Bucket": bucketname,
                        "Name": filename,
                    }
                },
                FeatureTypes=["FORMS"],
                HumanLoopConfig={
                    "HumanLoopName": uuid.uuid4().hex,
                    "FlowDefinitionArn": FLOW_ARN,
                    "DataAttributes": {
                        "ContentClassifiers": [
                            "FreeOfAdultContent",
                        ]
                    },
                },
            )
            print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

However, I am still getting the same UnsupportedDocumentException error.

Any help or pointers would be appreciated.

Thanks

sgral
asked 2 months ago140 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions