Scaling AWS step-functions and comprehend jobs with Concurrent active asynchronous jobs quota

0
  1. I am trying to implement a solution that integrates aws comprehend targeted sentiment along with step functions. And then make it public for people to use it as an api.

  2. I need to wait until the job is complete before being able to move forward with the workflow. Since the comprehend job is asynchronous, I created a wait time poller to periodically check the jobs status using describe_targeted_sentiment_detection_job. Following a similar integration pattern as this https://docs.aws.amazon.com/step-functions/latest/dg/sample-project-job-poller.html.

  3. However, there is seems to be a Concurrent active asynchronous jobs quota of 10 jobs according to https://docs.aws.amazon.com/comprehend/latest/dg/guidelines-and-limits.html#limits-active-jobs. If this is the case, I was thinking of creating another poll to check if comprehend is free to do targeted sentiment before starting another comprehend job

  4. Given that the step functions charge for each polling cycle. And that there is a concurrent job limit of 10. I am worried about the backlog and respective costs that may be created if many step-function executions were to be started. For example, if 1000 workflows are started. Workflow number 1000 will have to be polling for an available comprehend job for a long time.

Does anyone know if a solution is available to get around the concurrent active asynchronous jobs quota or to reduce the cost of step functions continually polling for a long time?

1 Answer
0

First, I would check of there is a way to increase the limit. A lot of service limits are soft and they can be increased.

In either case, you will probably not be able to increase the limit to very high numbers, so you will need to handle the concurrency yourself. One way of doing this would be to use DynamoDB to count how many active sessions you have. Before you start the Comprehend job you try to increment a counter in DDB with a condition that it is < Limit. If it succeeded, you go to the next step to start the Comprehend job. If it fails, you go into a Wait state and then try again. When the Comprehend job finishes, you increment the counter, without condition. To reduce the number of state transitions, The Wait state should be longer than the Wait when running the job.

A different, more complex solution, but with less state transitions, might be to use the Wait For Callback pattern. Every time the state machine fails to decrement the DDB counter, it will add a callback token to DDB item. You will create a DDB stream with a Lambda function, that every time the value of the counter goes below the limit, it will take a token from DDB and make a call with that token. You can create a Filter for the Lambda that consumes the stream to reduce the number of invocations.

profile pictureAWS
EXPERT
Uri
answered 2 years ago
  • On the SFn + DDB concurrency side, I'm aware of this SAM-based sample which shows a nice pattern. I previously ported it to CDK in this (Python) sample. However, for really big bursts you still have lots of DDB UpdateItem retry requests with that approach.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions