AWS architecture using AWS Textract

0

Hello,

We are facing extremely slow performance on getting the parsed results from AWS Textract. Our architecture is similar to this one (Text Extraction section). For instance, in order to parse 1 page PDF file we have to wait around 40 seconds for the whole pipeline.

We want to improve our pipeline and make the results available as fast as possible. Currently the process contains the following steps (according to architecture):

  1. Upload the file to AWS S3 bucket from API Gateway
  2. Run AWS Lambda and send message to AWS SQS from there
  3. In another AWS Lambda receive the message from AWS SNS when the job on AWS Textract side is ready.
  4. Write parsed results to S3 bucket
  5. Call API Gateway to get results from S3 bucket.

Is there a way we could enhance the pipeline or any parts of the pipeline (AWS SQS, AWS SNS, AWS Lambda or even AWS Textract)?

Thank you in advance.

4 Answers
0

We don't loose so much time on Step 2 since it's done in another microservice. Provisioned concurrency could be useful solution.

No, we don't use async Textract call. We are using start_document_text_detection and then start_document_analysis with different NotificationChannel arguments

memu
answered 2 years ago
0

If there are only a few documents than you probably are affected by Lambda cold starts. Depending on the used Language for Lambda that adds more ore less Latency to the pipeline.

If that is the case you can use provisioned concurrency (https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html) for the functions. That adds additional costs to the pipeline.

You can also eliminate the Lambda in Step 2 by Using the S3 Eventbridge integration and the define a Rule to put the S3 Event to SQS.

AWS
Marco
answered 2 years ago
0

It sounds like you are using async Textract call. This indeed will take some time to process and as far as I know processing times can vary and are not guaranteed. If you know and have pages in advance already extracted, you can call synchronous methods, which should be much faster.

AlbertK
answered 2 years ago
0

Would sending the base64 encoded image bytes directly through to Textract be of any help?

Visually speaking:

base64 -> API Gateway -> Lambda -> Textract

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions