Which Textract approach is faster/cheaper/better for 1M+ documents?


I have a pile of 1 million+ TIF files (single and multi-page) that I need to OCR, search for particular terms, and then by the end of my workflow, have images and text for each individual page. I have built a Step Machine with different lambdas that handle different parts of the pipeline. After the Step Machine has run on all of our raw images, I am using Python to collect the results and load pointers to the processed images, text, and JSON into a database.

I'm wondering if it makes more sense to split the pages into separate images before I use Textract to OCR, or if there are efficiencies in cost, speed, rate-limiting or something else that would be gained by not splitting the pages until after I OCR.

I get errors when I try to run detect_document_text() on my multi-page TIFs, but I am for some reason able to run detect_document_text() on at least some single-page TIFs. So I'm also wondering if there are efficiencies between detect_document_text() and start_document_text_detection().

We are a small nonprofit but are designing a repeatable workflow for other datasets of this size or larger. At the scale of 1 million + documents the costs/time does add up.

2 Answers

Price comparisons can get complex with multi-stage analytic pipelines like you mention, but I'd usually suggest to go with the Asynchronous APIs (e.g. StartDocumentTextDetection) and multi-page processing.

  • Although Amazon Textract pricing is by page regardless, for scalable workloads it's important to consider how you'll orchestrate around the rate and concurrency quotas.
  • The other key consideration (as it sounds like you've found already) is that the synchronous APIs don't support multi-page documents (as mentioned here).

In a distributed and event-driven architecture (for example, if your State Machine is triggered automatically whenever files are uploaded to Amazon S3), retries with backoff would be the usual recommended method for handling throttling... But ultimately, more retries-per-operation will usually translate into runtime and therefore cost in services like AWS Lambda (and/or more SFn state transitions). Dumping a large batch of documents in S3 to be pushed through Textract via Lambda with poorly-configured retry settings, could lead to unnecessary retry attempts. Pushing multi-page files directly through the async APIs relieves your system of some of this orchestration burden.

There are usually trade-offs between scalability, run-time, and cost-efficiency of different concurrency management approaches: For example a single-threaded Python script on local laptop could carefully wait to submit documents one-at-a-time and always under the maximum TPS limit... But it'd be very slow to process 1M+ docs, and might still throttle occasionally if there was some other Textract workload in the account that it wasn't aware of!

In the (1) textract-serverless-large-scale sample, SQS is used to limit the rate at which uploaded documents get submitted to Textract. SNS-based completion notification is also used (instead of e.g. polling the GetDocumentTextDetection API) to minimise pressure on quotas on the result side.

In the (2) textract-transformer-pipeline sample, we take an alternative Step Functions-based approach using DynamoDB as a concurrency limiter. It eliminates the SQS polling aspect of serverless-large-scale, but as a result can suffer from large DDB UpdateItem retry spikes when a large number of docs are uploaded all at once. This is why the utility function used for batch processing in notebook 1 applies some additional smoothing to the initiation of requests to the Textract state machine.

I'd probably expect approach (1) to be a better starting point from cost optimization perspective. The bigger your simultaneously-arriving batches of data, the more beneficial it will be to pull through some concurrency-limited queue rather than trying to handle via retries.

answered 4 months ago

Thank you very much for this very thorough explanation!

answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions