The amazon-textract-transformer-pipeline sample shows a scalable batch PDF-to-images converter you might customize for this use case: Check out the code in the notebooks/preproc subfolder and the usage in the "Extract clean input images" section of notebook 1.
The current implementation is based on poppler, pdf2image
and processes batches of documents (from Amazon S3) through SageMaker Processing. It's probably not the most efficient possible (Python...), but can scale up to bigger instances (via multiprocessing) and out to multiple instances (via data sharding).
If you needed (near)-real-time processing instead of batch, you could probably get a similar solution running on a containerized Lambda function (poppler requires a lower-level install than pip). In our draft upgrade branch, we instead use SageMaker Asynchronous Inference for this... Because request/response payload sizes and memory could be theoretically very large for documents with many pages.
Relevant questions
Images in Markdown file
asked 21 days agoPDF to single JPEGs then Bytearray
asked 17 days agoDoes the Jenkins plugin for CodeBuild work with OpenShift provided Jenkins Docker images?
Accepted Answerasked 3 years agoAPI remote images
asked 3 months agoAWS Kendra - Search PDF with handwritten text
Accepted Answerasked 3 months agoSolution to delete new ECR images, from PutImage actions, that contain CRITICAL vulnerabilites
asked 13 days agoMultiple images in a HIT with python API
asked 3 years agoAre resized images served with cloudfront and resized with Lambda edge cached?
Accepted Answerasked 3 days agoFaceId's when indexing multiple face images of same person
asked 4 months agoUsing Rekognition to process a batch of images on an S3 bucket
asked 2 months ago