PDF to single JPEGs then Bytearray

Question

I've determined through testing that Texttact handles images with a slight skew infinitely better than PDFs (with embedded images) with the same skew. I'm looking for help converting a multi-page PDF (or many actually) to an image format and have Textract process the concatenated bytearray.  Any ideas on how to tackle this?  Any sample code out there?

Answer

The [amazon-textract-transformer-pipeline sample](https://github.com/aws-samples/amazon-textract-transformer-pipeline) shows a scalable batch PDF-to-images converter you might customize for this use case: Check out the code in the [notebooks/preproc](https://github.com/aws-samples/amazon-textract-transformer-pipeline/tree/main/notebooks/preproc) subfolder and the usage in the *"Extract clean input images"* section of [notebook 1](https://github.com/aws-samples/amazon-textract-transformer-pipeline/blob/main/notebooks/1.%20Data%20Preparation.ipynb).

The current implementation is based on `poppler, pdf2image` and processes batches of documents (from Amazon S3) through SageMaker Processing. It's probably not the most efficient possible (Python...), but can scale up to bigger instances (via multiprocessing) and out to multiple instances (via data sharding).

If you needed **(near)-real-time** processing instead of batch, you could probably get a similar solution running on a containerized Lambda function (poppler requires a lower-level install than pip). In our [draft upgrade branch](https://github.com/aws-samples/amazon-textract-transformer-pipeline/pull/16), we instead use SageMaker Asynchronous Inference for this... Because request/response payload sizes and memory could be theoretically very large for documents with many pages.

PDF to single JPEGs then Bytearray

相关内容