How to run additional steps in sagemaker pipelines?

0

I am running sagemaker experiments , based on example here -> https://aws.amazon.com/blogs/machine-learning/track-your-ml-experiments-end-to-end-with-data-version-control-and-amazon-sagemaker-experiments/, once the experiments are done , i want to create a training/processing pipeline via sagemaker pipeline steps. similar to the sample in the link before, i want to run additioanal step/script, to track the model output via dvc (https://dvc.org/), does sagemaker pipeline have any custom steps we can define , such that we can trigger additional steps to invoke dvc commands . I am aware there is a lambda step, we can add to, but not sure if this is the right way to do this. in short, the lambda will need to keep track of the output location for model, (which assume can be passed from previous step in the sagemaker pipeline) , then run dvc commands to track and then run git commands to push it back to the repo.

1 Answer
1

I'm not 100% sure that I understand the question, but in the blog post you are referencing, data is being pushed to DVC from within the SageMaker processing job and training job. You could do the same thing in the processing and training jobs you want to include in the SageMaker Pipeline.

If there is another reason why you need to run DVC commands in a custom step after the training and processing steps have completed, then I would look at using a Lambda step first. If you expect your code to run for more than 15 minutes, or if there is another reason why Lambda is not a suitable choice for compute, you could use a Callback step. This step sends a message to an Amazon SQS queue, and you can trigger any process you want when you receive this message. When your process has finished running, you use an API call to inform SageMaker that the step has finished running.

AWS
S_Moose
answered a year ago
  • @S_Moose - thanks. the link i posted is just as an example. for my implementation , i want to run training/pre processing via sagemaker pipeline steps. and in the example , everything is done in the notebook (all the dvc commands and git commands) . my understanding , is once you create the pipeline and run it, you can come back it and run it from the sagemaker studio UI. that is the reason, i need to do this via code. so when processing steps runs and finishes the data it generates, i want to check in via dvc . also, once the training is finished , the output is dumped to an s3 bucket, i want to track that via dvc , so will need to run those dvc add and git commits command. also, this is off the topic, but can one configure the output bucket beforehand. so that once training is done, the model output is dumped to the s3 uri that i want.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions