AWS architecture using AWS Textract

0

Hello,

We are facing extremely slow performance on getting the parsed results from AWS Textract. Our architecture is similar to this one (Text Extraction section). For instance, in order to parse 1 page PDF file we have to wait around 40 seconds for the whole pipeline.

We want to improve our pipeline and make the results available as fast as possible. Currently the process contains the following steps (according to architecture):

  1. Upload the file to AWS S3 bucket from API Gateway
  2. Run AWS Lambda and send message to AWS SQS from there
  3. In another AWS Lambda receive the message from AWS SNS when the job on AWS Textract side is ready.
  4. Write parsed results to S3 bucket
  5. Call API Gateway to get results from S3 bucket.

Is there a way we could enhance the pipeline or any parts of the pipeline (AWS SQS, AWS SNS, AWS Lambda or even AWS Textract)?

Thank you in advance.

4 回答
0

We don't loose so much time on Step 2 since it's done in another microservice. Provisioned concurrency could be useful solution.

No, we don't use async Textract call. We are using start_document_text_detection and then start_document_analysis with different NotificationChannel arguments

memu
已回答 2 年前
0

If there are only a few documents than you probably are affected by Lambda cold starts. Depending on the used Language for Lambda that adds more ore less Latency to the pipeline.

If that is the case you can use provisioned concurrency (https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html) for the functions. That adds additional costs to the pipeline.

You can also eliminate the Lambda in Step 2 by Using the S3 Eventbridge integration and the define a Rule to put the S3 Event to SQS.

AWS
Marco
已回答 2 年前
0

It sounds like you are using async Textract call. This indeed will take some time to process and as far as I know processing times can vary and are not guaranteed. If you know and have pages in advance already extracted, you can call synchronous methods, which should be much faster.

AlbertK
已回答 2 年前
0

Would sending the base64 encoded image bytes directly through to Textract be of any help?

Visually speaking:

base64 -> API Gateway -> Lambda -> Textract

已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则