Textract AnalyzeDocument errors with some pdf files (unsupported document format)

0

Hi,

i get : *botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format *

when trying to analyze some pdf files to get tables extracted . My code is executed remotely (from my pc on pycharm) to launch textract on files stored on S3. My program is working with similar pdf with no error:

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)

The file is correctly analyzed using the web interface "try textract" (so i guess it's not corrupted)

Thanks in advance for your help.

  • You mention it works with "similar pdf" without an error. Can you validate that the same document works in the AWS Web console? If it works in the console, it should work through API as well, because the console uses the API in the background.

Eus
질문됨 일 년 전1347회 조회
2개 답변
1

ok i think you should add a control between

response = textract.analyze_document(
        Document={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])

and

    doc = Document(response)

in case there is no table extracted from the pdf file

table_blocks = [block for block in response['Blocks'] if block['BlockType'] == 'TABLE']

if not table_blocks:
    print("No tables found in the document.")
else:
    # process table data here

doc = Document(response)

profile picture
전문가
답변함 일 년 전
0

Hi there,

For PDFs, you should use start_document_analysis. You can update you code to something similar:

response = textract.start_document_analysis(
        DocumentLocation={
            'S3Object':{
                'Bucket': bucket_name,
                'Name':document_name
            }
        },
        FeatureTypes= ["TABLES"])
    doc = Document(response)
2bz
답변함 7달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠