AWS architecture using AWS Textract
Hello,
We are facing extremely slow performance on getting the parsed results from AWS Textract. Our architecture is similar to this one (Text Extraction section). For instance, in order to parse 1 page PDF file we have to wait around 40 seconds for the whole pipeline.
We want to improve our pipeline and make the results available as fast as possible. Currently the process contains the following steps (according to architecture):
- Upload the file to AWS S3 bucket from API Gateway
- Run AWS Lambda and send message to AWS SQS from there
- In another AWS Lambda receive the message from AWS SNS when the job on AWS Textract side is ready.
- Write parsed results to S3 bucket
- Call API Gateway to get results from S3 bucket.
Is there a way we could enhance the pipeline or any parts of the pipeline (AWS SQS, AWS SNS, AWS Lambda or even AWS Textract)?
Thank you in advance.
If there are only a few documents than you probably are affected by Lambda cold starts. Depending on the used Language for Lambda that adds more ore less Latency to the pipeline.
If that is the case you can use provisioned concurrency (https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html) for the functions. That adds additional costs to the pipeline.
You can also eliminate the Lambda in Step 2 by Using the S3 Eventbridge integration and the define a Rule to put the S3 Event to SQS.
It sounds like you are using async Textract call. This indeed will take some time to process and as far as I know processing times can vary and are not guaranteed. If you know and have pages in advance already extracted, you can call synchronous methods, which should be much faster.
We don't loose so much time on Step 2 since it's done in another microservice. Provisioned concurrency could be useful solution.
No, we don't use async Textract call. We are using start_document_text_detection and then start_document_analysis with different NotificationChannel arguments
Relevant questions
Textract table extraction, splitting the table into two horizontal parts. How to get past this.
asked 2 months agoAWS architecture using AWS Textract
asked 5 months agoInconsistent results from Textract
asked 5 months agoTextract Analyze Document demo error
asked 3 days agoGetting reimbursed for failed recognition cases from Textract
asked 6 months agoTextract performance degradation
asked a month agoCharacter coordinates in Textract
asked a year agoHow does textract determine when to segment text vertically or horizontally?
asked 5 months agounderscore not detected in textract
asked 3 months agoReal time application insights architecture on AWS
Accepted Answerasked 2 months ago