AWS Textract issue

0

Hi AWS, we are working on a project that requires real-time document processing and we are encountering latency issues with AWS Textract for multipage, large PDF files. Despite using the asynchronous APIs offered by AWS Textract, we are still facing significant delays, with processing times reaching up to 4-5 minutes for a 3MB file containing complex tables and images.

Our goal is to reduce this processing time to 1-1.5 minutes to enhance the user experience. If there is a way to fix this issue please suggest.

  • please accept the answer if it was useful

profile picture
asked 23 days ago124 views
1 Answer
2

What's about splitting document to different pages and process them in parallel ?

Use a tool or library (like PyPDF2 for Python) to split the PDF into individual pages or smaller chunks.

from PyPDF2 import PdfFileReader, PdfFileWriter

def split_pdf(input_pdf):
    pdf = PdfFileReader(input_pdf)
    for page_num in range(pdf.numPages):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page_num))
        
        output_filename = f'page_{page_num + 1}.pdf'
        with open(output_filename, 'wb') as out:
            pdf_writer.write(out)
        yield output_filename

Use AWS Lambda or Step Functions to process each split PDF concurrently using Textract’s asynchronous API.

import boto3
import concurrent.futures

textract = boto3.client('textract')

def process_pdf(file_name):
    with open(file_name, 'rb') as document:
        response = textract.start_document_text_detection(Document={'Bytes': document.read()})
        return response['JobId']

with concurrent.futures.ThreadPoolExecutor() as executor:
    file_names = list(split_pdf('large_document.pdf'))
    future_to_file = {executor.submit(process_pdf, file): file for file in file_names}
    for future in concurrent.futures.as_completed(future_to_file):
        file = future_to_file[future]
        try:
            job_id = future.result()
            print(f'Job ID for {file}: {job_id}')
        except Exception as exc:
            print(f'{file} generated an exception: {exc}')
profile picture
EXPERT
answered 23 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions