UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

1

Hi, I have a multi-page PDF document which I can process fine and extract key value pair in Amazon Textract web interface. However, when I try to extract key value pairs in my Python code, it returns below error: -

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Below is my code: -

    response = textract.analyze_document(
        Document={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        FeatureTypes=["FORMS"],
        HumanLoopConfig={
            "HumanLoopName": uuid.uuid4().hex,
            "FlowDefinitionArn": FLOW_ARN,
            "DataAttributes": {
                "ContentClassifiers": [
                    "FreeOfPersonallyIdentifiableInformation",
                    "FreeOfAdultContent",
                ]
            },
        },
    )
    print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

I thought because my PDF document is multi-page maybe because of that it is not able to read it so I tried to read the PDF page by page and modified my code to below: -

    # Start document text detection
    response = textract.start_document_text_detection(
        DocumentLocation={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        ClientRequestToken=str(uuid.uuid4())  # Generate a unique client request token
    )
    
    # Retrieve the job ID from the response
    job_id = response["JobId"]
    
    # Poll for the completion of the job
    while True:
        job_status = textract.get_document_text_detection(JobId=job_id)['JobStatus']
        if job_status in ['SUCCEEDED', 'FAILED']:
            break
        time.sleep(5)  # Wait for 5 seconds before checking again
    
    # Get the results of the detection
    response = textract.get_document_text_detection(JobId=job_id)
    
    # Process each page of the document
    for page_result in response['Blocks']:
        if page_result['BlockType'] == 'PAGE':
            page_number = page_result['Page']
            response = textract.analyze_document(
                Document={
                    "S3Object": {
                        "Bucket": bucketname,
                        "Name": filename,
                    }
                },
                FeatureTypes=["FORMS"],
                HumanLoopConfig={
                    "HumanLoopName": uuid.uuid4().hex,
                    "FlowDefinitionArn": FLOW_ARN,
                    "DataAttributes": {
                        "ContentClassifiers": [
                            "FreeOfAdultContent",
                        ]
                    },
                },
            )
            print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

However, I am still getting the same UnsupportedDocumentException error.

Any help or pointers would be appreciated.

Thanks

sgral
gefragt vor 2 Monaten144 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen