AWS architecture using AWS Textract

0

Hello,

We are facing extremely slow performance on getting the parsed results from AWS Textract. Our architecture is similar to this one (Text Extraction section). For instance, in order to parse 1 page PDF file we have to wait around 40 seconds for the whole pipeline.

We want to improve our pipeline and make the results available as fast as possible. Currently the process contains the following steps (according to architecture):

  1. Upload the file to AWS S3 bucket from API Gateway
  2. Run AWS Lambda and send message to AWS SQS from there
  3. In another AWS Lambda receive the message from AWS SNS when the job on AWS Textract side is ready.
  4. Write parsed results to S3 bucket
  5. Call API Gateway to get results from S3 bucket.

Is there a way we could enhance the pipeline or any parts of the pipeline (AWS SQS, AWS SNS, AWS Lambda or even AWS Textract)?

Thank you in advance.

4개 답변
0

We don't loose so much time on Step 2 since it's done in another microservice. Provisioned concurrency could be useful solution.

No, we don't use async Textract call. We are using start_document_text_detection and then start_document_analysis with different NotificationChannel arguments

memu
답변함 2년 전
0

If there are only a few documents than you probably are affected by Lambda cold starts. Depending on the used Language for Lambda that adds more ore less Latency to the pipeline.

If that is the case you can use provisioned concurrency (https://docs.aws.amazon.com/lambda/latest/dg/provisioned-concurrency.html) for the functions. That adds additional costs to the pipeline.

You can also eliminate the Lambda in Step 2 by Using the S3 Eventbridge integration and the define a Rule to put the S3 Event to SQS.

AWS
Marco
답변함 2년 전
0

It sounds like you are using async Textract call. This indeed will take some time to process and as far as I know processing times can vary and are not guaranteed. If you know and have pages in advance already extracted, you can call synchronous methods, which should be much faster.

AlbertK
답변함 2년 전
0

Would sending the base64 encoded image bytes directly through to Textract be of any help?

Visually speaking:

base64 -> API Gateway -> Lambda -> Textract

답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠