Textract AnalyzeDocument errors with some pdf files (unsupported document format)

0

Hi,

i get : *botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format *

when trying to analyze some pdf files to get tables extracted . My code is executed remotely (from my pc on pycharm) to launch textract on files stored on S3. My program is working with similar pdf with no error:

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)

The file is correctly analyzed using the web interface "try textract" (so i guess it's not corrupted)

Thanks in advance for your help.

  • You mention it works with "similar pdf" without an error. Can you validate that the same document works in the AWS Web console? If it works in the console, it should work through API as well, because the console uses the API in the background.

Eus
demandé il y a un an1348 vues
2 réponses
1

ok i think you should add a control between

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])

and

    doc = Document(response)

in case there is no table extracted from the pdf file

table_blocks = [block for block in response['Blocks'] if block['BlockType'] == 'TABLE']

if not table_blocks:
    print("No tables found in the document.")
else:
    # process table data here

doc = Document(response)

profile picture
EXPERT
répondu il y a un an
0

Hi there,

For PDFs, you should use start_document_analysis. You can update you code to something similar:

response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)
2bz
répondu il y a 7 mois

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions