Amazon SageMaker - Training Job / Data Wrangler

Question

I have a customer who is interested in testing Amazon Sagemaker and would like to consult the following questions:

Q1. While submitting training job in Amazon Sagemaker, if there is insufficient capacity occurred, would there be any auto-retry mechanism? How to set up?
Q2. Is the underlying SQL / MySQL infrastructure in Data Wrangler from AWS serverless DB backend?
Q3. What is the backend database to support Sagemaker / Sagemkaer Data Wrangler ?

Use case: Vision ML - Object detection (self-built algorithm)
Framework: Tensorflow 2.4.4

Answer

Hi,

Q1/ this is not a built in feature for training job API. You'd need to implement on your side with some try:catch mechanism. If instead you are using SageMaker Pipelines to start the jobs, then that has such functionality (see: Retry mechanism)

Q2/Q3/ SageMaker Data Wrangler does NOT implement a database. It does offer the option to connect to a number of data sources, including databases. Is this what you meant? Can you elaborate a bit more on these two points on what you are looking for?

Thank you,
G

Answer

Q1：In order to realize the retry of Training Job, you can use EventBridge to integrate Sagemaker Pipeline. For details, please refer to the following link
https://docs.aws.amazon.com/sagemaker/latest/dg/pipeline-eventbridge.html

Amazon SageMaker - Training Job / Data Wrangler

관련 콘텐츠