Hi,
I have a multi-page PDF document which I can process fine and extract key value pair in Amazon Textract web interface. However, when I try to extract key value pairs in my Python code, it returns below error: -
UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format
Below is my code: -
response = textract.analyze_document(
Document={
"S3Object": {
"Bucket": bucketname,
"Name": filename,
}
},
FeatureTypes=["FORMS"],
HumanLoopConfig={
"HumanLoopName": uuid.uuid4().hex,
"FlowDefinitionArn": FLOW_ARN,
"DataAttributes": {
"ContentClassifiers": [
"FreeOfPersonallyIdentifiableInformation",
"FreeOfAdultContent",
]
},
},
)
print(json.dumps(response))
return {
"statusCode": 200,
"body": json.dumps("Document processed successfully!"),
}
return {"statusCode": 500, "body": json.dumps("Issue processing file!")}
I thought because my PDF document is multi-page maybe because of that it is not able to read it so I tried to read the PDF page by page and modified my code to below: -
# Start document text detection
response = textract.start_document_text_detection(
DocumentLocation={
"S3Object": {
"Bucket": bucketname,
"Name": filename,
}
},
ClientRequestToken=str(uuid.uuid4()) # Generate a unique client request token
)
# Retrieve the job ID from the response
job_id = response["JobId"]
# Poll for the completion of the job
while True:
job_status = textract.get_document_text_detection(JobId=job_id)['JobStatus']
if job_status in ['SUCCEEDED', 'FAILED']:
break
time.sleep(5) # Wait for 5 seconds before checking again
# Get the results of the detection
response = textract.get_document_text_detection(JobId=job_id)
# Process each page of the document
for page_result in response['Blocks']:
if page_result['BlockType'] == 'PAGE':
page_number = page_result['Page']
response = textract.analyze_document(
Document={
"S3Object": {
"Bucket": bucketname,
"Name": filename,
}
},
FeatureTypes=["FORMS"],
HumanLoopConfig={
"HumanLoopName": uuid.uuid4().hex,
"FlowDefinitionArn": FLOW_ARN,
"DataAttributes": {
"ContentClassifiers": [
"FreeOfAdultContent",
]
},
},
)
print(json.dumps(response))
return {
"statusCode": 200,
"body": json.dumps("Document processed successfully!"),
}
return {"statusCode": 500, "body": json.dumps("Issue processing file!")}
However, I am still getting the same UnsupportedDocumentException error.
Any help or pointers would be appreciated.
Thanks