- Newest
- Most votes
- Most comments
Price comparisons can get complex with multi-stage analytic pipelines like you mention, but I'd usually suggest to go with the Asynchronous APIs (e.g. StartDocumentTextDetection) and multi-page processing.
- Although Amazon Textract pricing is by page regardless, for scalable workloads it's important to consider how you'll orchestrate around the rate and concurrency quotas.
- The other key consideration (as it sounds like you've found already) is that the synchronous APIs don't support multi-page documents (as mentioned here).
In a distributed and event-driven architecture (for example, if your State Machine is triggered automatically whenever files are uploaded to Amazon S3), retries with backoff would be the usual recommended method for handling throttling... But ultimately, more retries-per-operation will usually translate into runtime and therefore cost in services like AWS Lambda (and/or more SFn state transitions). Dumping a large batch of documents in S3 to be pushed through Textract via Lambda with poorly-configured retry settings, could lead to unnecessary retry attempts. Pushing multi-page files directly through the async APIs relieves your system of some of this orchestration burden.
There are usually trade-offs between scalability, run-time, and cost-efficiency of different concurrency management approaches: For example a single-threaded Python script on local laptop could carefully wait to submit documents one-at-a-time and always under the maximum TPS limit... But it'd be very slow to process 1M+ docs, and might still throttle occasionally if there was some other Textract workload in the account that it wasn't aware of!
In the (1) textract-serverless-large-scale sample, SQS is used to limit the rate at which uploaded documents get submitted to Textract. SNS-based completion notification is also used (instead of e.g. polling the GetDocumentTextDetection API) to minimise pressure on quotas on the result side.
In the (2) textract-transformer-pipeline sample, we take an alternative Step Functions-based approach using DynamoDB as a concurrency limiter. It eliminates the SQS polling aspect of serverless-large-scale, but as a result can suffer from large DDB UpdateItem retry spikes when a large number of docs are uploaded all at once. This is why the utility function used for batch processing in notebook 1 applies some additional smoothing to the initiation of requests to the Textract state machine.
I'd probably expect approach (1) to be a better starting point from cost optimization perspective. The bigger your simultaneously-arriving batches of data, the more beneficial it will be to pull through some concurrency-limited queue rather than trying to handle via retries.
Relevant content
- asked 2 months ago
- asked 3 years ago
- Accepted Answerasked 6 years ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 2 years ago