1 Answer
- Newest
- Most votes
- Most comments
2
What's about splitting document to different pages and process them in parallel ?
Use a tool or library (like PyPDF2 for Python) to split the PDF into individual pages or smaller chunks.
from PyPDF2 import PdfFileReader, PdfFileWriter
def split_pdf(input_pdf):
pdf = PdfFileReader(input_pdf)
for page_num in range(pdf.numPages):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page_num))
output_filename = f'page_{page_num + 1}.pdf'
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
yield output_filename
Use AWS Lambda or Step Functions to process each split PDF concurrently using Textract’s asynchronous API.
import boto3
import concurrent.futures
textract = boto3.client('textract')
def process_pdf(file_name):
with open(file_name, 'rb') as document:
response = textract.start_document_text_detection(Document={'Bytes': document.read()})
return response['JobId']
with concurrent.futures.ThreadPoolExecutor() as executor:
file_names = list(split_pdf('large_document.pdf'))
future_to_file = {executor.submit(process_pdf, file): file for file in file_names}
for future in concurrent.futures.as_completed(future_to_file):
file = future_to_file[future]
try:
job_id = future.result()
print(f'Job ID for {file}: {job_id}')
except Exception as exc:
print(f'{file} generated an exception: {exc}')
Relevant content
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 10 months ago
please accept the answer if it was useful