Textract AnalyzeDocument errors with some pdf files (unsupported document format)

0

Hi,

i get : *botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format *

when trying to analyze some pdf files to get tables extracted . My code is executed remotely (from my pc on pycharm) to launch textract on files stored on S3. My program is working with similar pdf with no error:

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)

The file is correctly analyzed using the web interface "try textract" (so i guess it's not corrupted)

Thanks in advance for your help.

  • You mention it works with "similar pdf" without an error. Can you validate that the same document works in the AWS Web console? If it works in the console, it should work through API as well, because the console uses the API in the background.

Eus
asked a year ago1276 views
2 Answers
1

ok i think you should add a control between

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])

and

    doc = Document(response)

in case there is no table extracted from the pdf file

table_blocks = [block for block in response['Blocks'] if block['BlockType'] == 'TABLE']

if not table_blocks:
    print("No tables found in the document.")
else:
    # process table data here

doc = Document(response)

profile picture
EXPERT
answered a year ago
0

Hi there,

For PDFs, you should use start_document_analysis. You can update you code to something similar:

response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)
2bz
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions