Textract to multi column pdf files

0

I am using the code below that I took from an example https://aws.amazon.com/pt/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/, in the example it is used only for a case of 2 columns, in the code where there is division by 2, if my file has 4 columns for example, I just change that it works. But how to detect the amount of columns automatically or some way that I don't need this manual input anymore? In summary I want to use this code for cases of pdf files that have more than 2 columns, how to do it?

import boto3
# Document
s3BucketName = "amazon-textract-public-content"
documentName = "blogs/two-column-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])
질문됨 2년 전1542회 조회
1개 답변
0

You may like to try the Amazon Textract Response Parser for this, and note in particular that the JavaScript/TypeScript library's getLineClustersInReadingOrder() implementation is very different from the Python library's getLinesInReadingOrder().

From a very biased (author's) perspective I would argue that the JS library's current heuristic is better. You can see a couple of example images it's tested against in the code repository - and I'd suggest it's well worth trying out if you're able to consume components in JS or TS as well as Python.

But ultimately, all these methods are rule-based heuristics and none are perfect: Often what you gain in performance on some use cases, you lose in code maintainability and weird/counter-intuitive errors on others. At the extreme, many complex layouts even challenge/break the idea that there's "one correct reading order" for content on a page anyway - like posters or advertisements with very variable text.

I'd suggest to go with the simplest method that works well enough for your actual documents, and also to revisit why you're trying to extract this columnar structure in the first place in case there are better options:

AWS
전문가
Alex_T
답변함 2년 전
  • Hi thanks for your reply! Basically i need to extract all the text from several pdf files. and I will save in a structured way. And within these pages I have the variation of 1 to 5 columns sometimes and sometimes not, but the average is 2 columns

  • In this code my big problem is that the columns are variables and this division /2 that is done varies and can be /2, /3, /4 or /5

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠