Textract AnalyzeDocument errors with some pdf files (unsupported document format)

0

Hi,

i get : *botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format *

when trying to analyze some pdf files to get tables extracted . My code is executed remotely (from my pc on pycharm) to launch textract on files stored on S3. My program is working with similar pdf with no error:

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)

The file is correctly analyzed using the web interface "try textract" (so i guess it's not corrupted)

Thanks in advance for your help.

  • You mention it works with "similar pdf" without an error. Can you validate that the same document works in the AWS Web console? If it works in the console, it should work through API as well, because the console uses the API in the background.

Eus
preguntada hace un año1353 visualizaciones
2 Respuestas
1

ok i think you should add a control between

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])

and

    doc = Document(response)

in case there is no table extracted from the pdf file

table_blocks = [block for block in response['Blocks'] if block['BlockType'] == 'TABLE']

if not table_blocks:
    print("No tables found in the document.")
else:
    # process table data here

doc = Document(response)

profile picture
EXPERTO
respondido hace un año
0

Hi there,

For PDFs, you should use start_document_analysis. You can update you code to something similar:

response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)
2bz
respondido hace 7 meses

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas