Maximum concurrency and retry of a Sagemaker pipeline triggered by EventBridge

0

Hi everyone,

I have a Sagemaker pipeline that trains a model and run inference when a new object is uploaded in a S3 bucket. Orchestration is done by EventBridge. I would like to set a maximum concurrency limit defining the maxinum number of my sagemaker pipeline running simultaneously. If the limit is reached, unprocessed events are queued somewhere before being processed when the limit is not reached anymore. Is it possible to achieve that with EventBridge + Sagemaker pipeline ? Also how can I configure a retry policy when the pipeline fails processing an event ?

Many thanks in advance for your help!

1 Answer
1

To manage the maximum concurrency limit for your SageMaker pipeline and handle unprocessed events effectively, you might need to combine several AWS services, as EventBridge alone does not directly support concurrency limits or queuing. Here's a more concise strategy:

  1. Manage Concurrency: Use AWS Lambda as an intermediary between EventBridge and your SageMaker pipeline. Set a concurrency limit on the Lambda function to control how many pipeline instances can run simultaneously. This approach indirectly sets a concurrency limit on your SageMaker pipeline.

  2. Queue Unprocessed Events: For events that exceed the concurrency limit, use AWS Simple Queue Service (SQS) to queue these events. Then, process and trigger your SageMaker pipeline from the SQS queue as the concurrency limits allow or resources become available.

  3. Retry Policy for Failures: Configure retry policies directly in your SageMaker pipeline definition to automatically handle retries in case of transient failures or errors during pipeline execution.

Also how can I configure a retry policy when the pipeline fails processing an event ?

Absolutely! Feel free to check out these helpful guidelines for setting up a retry policy in Amazon SageMaker if you encounter any errors. You can find them here: Retry Policy Configuration Guidelines.

A Dead-Letter Queue (DLQ) with EventBridge is essential for handling events that fail due to errors, not for managing concurrency. By using a DLQ, you can ensure failed events are captured for later analysis or reprocessing, maintaining the integrity of your event-driven applications. This setup is crucial for troubleshooting and recovering from processing failures, preventing the loss of critical events. For detailed setup and usage, refer to the EventBridge Dead-Letter Queue Configuration in the AWS documentation.

profile picture
EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions