UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Hi, I have a multi-page PDF document which I can process fine and extract key value pair in Amazon Textract web interface. However, when I try to extract key value pairs in my Python code, it returns below error: -

Below is my code: -

    response = textract.analyze_document(
        Document={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        FeatureTypes=["FORMS"],
        HumanLoopConfig={
            "HumanLoopName": uuid.uuid4().hex,
            "FlowDefinitionArn": FLOW_ARN,
            "DataAttributes": {
                "ContentClassifiers": [
                    "FreeOfPersonallyIdentifiableInformation",
                    "FreeOfAdultContent",
                ]
            },
        },
    )
    print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

I thought because my PDF document is multi-page maybe because of that it is not able to read it so I tried to read the PDF page by page and modified my code to below: -

    # Start document text detection
    response = textract.start_document_text_detection(
        DocumentLocation={
            "S3Object": {
                "Bucket": bucketname,
                "Name": filename,
            }
        },
        ClientRequestToken=str(uuid.uuid4())  # Generate a unique client request token
    )
    
    # Retrieve the job ID from the response
    job_id = response["JobId"]
    
    # Poll for the completion of the job
    while True:
        job_status = textract.get_document_text_detection(JobId=job_id)['JobStatus']
        if job_status in ['SUCCEEDED', 'FAILED']:
            break
        time.sleep(5)  # Wait for 5 seconds before checking again
    
    # Get the results of the detection
    response = textract.get_document_text_detection(JobId=job_id)
    
    # Process each page of the document
    for page_result in response['Blocks']:
        if page_result['BlockType'] == 'PAGE':
            page_number = page_result['Page']
            response = textract.analyze_document(
                Document={
                    "S3Object": {
                        "Bucket": bucketname,
                        "Name": filename,
                    }
                },
                FeatureTypes=["FORMS"],
                HumanLoopConfig={
                    "HumanLoopName": uuid.uuid4().hex,
                    "FlowDefinitionArn": FLOW_ARN,
                    "DataAttributes": {
                        "ContentClassifiers": [
                            "FreeOfAdultContent",
                        ]
                    },
                },
            )
            print(json.dumps(response))

    return {
        "statusCode": 200,
        "body": json.dumps("Document processed successfully!"),
    }

return {"statusCode": 500, "body": json.dumps("Issue processing file!")}

However, I am still getting the same UnsupportedDocumentException error.

Any help or pointers would be appreciated.

Thanks

Argomenti

Serverless Computazionali Apprendimento automatico e intelligenza artificiale

Tag

AWS Lambda Amazon Textract

Lingua

English

sgral

posta 2 mesi fa144 visualizzazioni

Nessuna risposta

Più recenti
Maggior numero di voti
Maggior numero di commenti

Contenuto pertinente

How do I access a private API Gateway API when the VPC endpoint uses an on-premises DNS?
AWS UFFICIALEAggiornata 3 anni fa
Come posso risolvere l'errore "An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation" ("Si è verificato un errore (TargetNotConnectedException) durante la chiamata dell'operazione ExecuteCommand") in Amazon ECS?
AWS UFFICIALEAggiornata 2 anni fa
Come posso risolvere l'errore "403 ERROR - The request could not be satisfied. Bad Request" in CloudFront?
AWS UFFICIALEAggiornata un anno fa
Come posso fornire feedback o segnalare errori nella documentazione AWS?
AWS UFFICIALEAggiornata 3 anni fa