Textract AnalyzeDocument errors with some pdf files (unsupported document format)

0

Hi,

i get : *botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format *

when trying to analyze some pdf files to get tables extracted . My code is executed remotely (from my pc on pycharm) to launch textract on files stored on S3. My program is working with similar pdf with no error:

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)

The file is correctly analyzed using the web interface "try textract" (so i guess it's not corrupted)

Thanks in advance for your help.

  • You mention it works with "similar pdf" without an error. Can you validate that the same document works in the AWS Web console? If it works in the console, it should work through API as well, because the console uses the API in the background.

Eus
posta un anno fa1346 visualizzazioni
2 Risposte
1

ok i think you should add a control between

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])

and

    doc = Document(response)

in case there is no table extracted from the pdf file

table_blocks = [block for block in response['Blocks'] if block['BlockType'] == 'TABLE']

if not table_blocks:
    print("No tables found in the document.")
else:
    # process table data here

doc = Document(response)

profile picture
ESPERTO
con risposta un anno fa
0

Hi there,

For PDFs, you should use start_document_analysis. You can update you code to something similar:

response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)
2bz
con risposta 7 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande