Price comparisons can get complex with multi-stage analytic pipelines like you mention, but I'd usually suggest to go with the Asynchronous APIs (e.g. StartDocumentTextDetection) and multi-page processing.
- Although Amazon Textract pricing is by page regardless, for scalable workloads it's important to consider how you'll orchestrate around the rate and concurrency quotas.
- The other key consideration (as it sounds like you've found already) is that the synchronous APIs don't support multi-page documents (as mentioned here).
In a distributed and event-driven architecture (for example, if your State Machine is triggered automatically whenever files are uploaded to Amazon S3), retries with backoff would be the usual recommended method for handling throttling... But ultimately, more retries-per-operation will usually translate into runtime and therefore cost in services like AWS Lambda (and/or more SFn state transitions). Dumping a large batch of documents in S3 to be pushed through Textract via Lambda with poorly-configured retry settings, could lead to unnecessary retry attempts. Pushing multi-page files directly through the async APIs relieves your system of some of this orchestration burden.
There are usually trade-offs between scalability, run-time, and cost-efficiency of different concurrency management approaches: For example a single-threaded Python script on local laptop could carefully wait to submit documents one-at-a-time and always under the maximum TPS limit... But it'd be very slow to process 1M+ docs, and might still throttle occasionally if there was some other Textract workload in the account that it wasn't aware of!
In the (1) textract-serverless-large-scale sample, SQS is used to limit the rate at which uploaded documents get submitted to Textract. SNS-based completion notification is also used (instead of e.g. polling the GetDocumentTextDetection API) to minimise pressure on quotas on the result side.
In the (2) textract-transformer-pipeline sample, we take an alternative Step Functions-based approach using DynamoDB as a concurrency limiter. It eliminates the SQS polling aspect of serverless-large-scale, but as a result can suffer from large DDB UpdateItem retry spikes when a large number of docs are uploaded all at once. This is why the utility function used for batch processing in notebook 1 applies some additional smoothing to the initiation of requests to the Textract state machine.
I'd probably expect approach (1) to be a better starting point from cost optimization perspective. The bigger your simultaneously-arriving batches of data, the more beneficial it will be to pull through some concurrency-limited queue rather than trying to handle via retries.
Thank you very much for this very thorough explanation!
Response messages for long multipage documentsasked a year ago
Is there a way to download a searchable PDF file from Textract?asked 3 months ago
Is Textract aware of document language and character set used within the language?asked 7 days ago
What is best database or search service to store and search for millions of ip addresses?Accepted Answerasked 4 years ago
[Announcement] Amazon Textract adds synchronous support for single page PDF documents and support for PDF documents containing JPEG 2000 encoded imagesasked 10 months ago
Textract to multi column pdf filesasked 6 months ago
Regarding Amazon Textract big file processasked 3 months ago
Textract Async processing of larger filesasked 7 days ago
AnalyzeExpense - Analyzing Invoices and Receiptsasked a year ago
Which Textract approach is faster/cheaper/better for 1M+ documents?asked 4 months ago