Which Textract approach is faster/cheaper/better for 1M+ documents?

Question

I have a pile of 1 million+ TIF files (single and multi-page) that I need to OCR, search for particular terms, and then by the end of my workflow, have images and text for each individual page. I have built a Step Machine with different lambdas that handle different parts of the pipeline. After the Step Machine has run on all of our raw images, I am using Python to collect the results and load pointers to the processed images, text, and JSON into a database.

I'm wondering if it makes more sense to split the pages into separate images before I use Textract to OCR, or if there are efficiencies in cost, speed, rate-limiting or something else that would be gained by not splitting the pages until after I OCR.

I get errors when I try to run detect_document_text() on my multi-page TIFs, but I am for some reason able to run detect_document_text() on at least some single-page TIFs. So I'm also wondering if there are efficiencies between detect_document_text() and start_document_text_detection().

We are a small nonprofit but are designing a repeatable workflow for other datasets of this size or larger. At the scale of 1 million + documents the costs/time does add up.

Answer

Price comparisons can get complex with multi-stage analytic pipelines like you mention, but I'd usually suggest to go with the Asynchronous APIs (e.g. StartDocumentTextDetection) and multi-page processing.

- Although [Amazon Textract pricing](https://aws.amazon.com/textract/pricing/) is by page regardless, for scalable workloads it's important to consider how you'll orchestrate around the [rate and concurrency quotas](https://docs.aws.amazon.com/general/latest/gr/textract.html#limits_textract).
- The other key consideration (as it sounds like you've found already) is that the **synchronous APIs don't support multi-page documents** (as mentioned [here](https://docs.aws.amazon.com/textract/latest/dg/sync.html)).

In a distributed and event-driven architecture (for example, if your State Machine is triggered automatically whenever files are uploaded to Amazon S3), **retries with backoff** would be the usual [recommended](https://docs.aws.amazon.com/general/latest/gr/api-retries.html) method for handling throttling... But ultimately, more retries-per-operation will usually translate into runtime and therefore cost in services like AWS Lambda (and/or more SFn state transitions). Dumping a large batch of documents in S3 to be pushed through Textract via Lambda with poorly-configured retry settings, could lead to unnecessary retry attempts. Pushing multi-page files directly through the async APIs relieves your system of *some* of this orchestration burden.

There are usually trade-offs between scalability, run-time, and cost-efficiency of different concurrency management approaches: For example a single-threaded Python script on local laptop could carefully wait to submit documents one-at-a-time and always under the maximum TPS limit... But it'd be very slow to process 1M+ docs, and might still throttle occasionally if there was some other Textract workload in the account that it wasn't aware of!

In the (1) [textract-serverless-large-scale sample](https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing), SQS is used to limit the rate at which uploaded documents get submitted to Textract. SNS-based completion notification is also used (instead of e.g. polling the GetDocumentTextDetection API) to minimise pressure on quotas on the result side.

In the (2) [textract-transformer-pipeline sample](https://github.com/aws-samples/amazon-textract-transformer-pipeline), we take an alternative Step Functions-based approach [using DynamoDB as a concurrency limiter](https://aws.amazon.com/blogs/compute/controlling-concurrency-in-distributed-systems-using-aws-step-functions/). It eliminates the SQS polling aspect of serverless-large-scale, but as a result can suffer from large DDB UpdateItem retry spikes when a large number of docs are uploaded all at once. This is why the [utility function](https://github.com/aws-samples/amazon-textract-transformer-pipeline/blob/0c11b4412595699ec1da20eb33c7f640f00d50be/notebooks/util/preproc.py#L339) used for batch processing in [notebook 1](https://github.com/aws-samples/amazon-textract-transformer-pipeline/blob/0c11b4412595699ec1da20eb33c7f640f00d50be/notebooks/1.%20Data%20Preparation.ipynb) applies some additional smoothing to the initiation of requests to the Textract state machine.

I'd probably expect approach (1) to be a better starting point from cost optimization perspective. The bigger your simultaneously-arriving batches of data, the more beneficial it will be to pull through some concurrency-limited queue rather than trying to handle via retries.

Answer

Thank you very much for this very thorough explanation!

Which Textract approach is faster/cheaper/better for 1M+ documents?

相關內容