Is it possible to create Parallel Pipelines in Sagemaker


I want to bind processing pipeline to multiple training pipeline. I just want to compare algorithm accuracy. Same dataset will be trained by multiple algorithms and will be predicted by them. My goal for the future is consolidate predict results of different algorithms and generate combined/consolidated resulst. Is is possible to do in SageMaker.

Example Schema:

            - Train_Algo1      
 Process    - Train_Algo2    - Predict Result
            - Train_AlgoN
asked 2 years ago1887 views
1 Answer

I'd recommend checking out SageMaker Pipelines for this - especially if you're able to use SageMaker Studio for the graphical pipeline management UIs.

You can build your pipeline definition through the SageMaker Python SDK, just like you might normally define Training and Processing jobs. In fact pipeline steps (like TrainingStep) typically just wrap around the standalone constructs (like Estimator) that you might be using already.

Pipeline steps are executed in paralllel by default, unless there is an implicit (properties data) or explicit (depends_on) dependency between them.

SM Pipelines can take parameters, so you could expose necessary training hyperparameters or pre-processing parameters up to the pipeline level, and use the pipeline to kick off multiple end-to-end runs with different configurations.

By turning on step caching, you could prevent your pre-processing from being re-run if the input parameters are unchanged (however, note that caching doesn't look at ongoing executions: So better to trigger one pipeline execution first and wait a bit for the processing step to complete, rather than triggering ~20 all at once so none of them see a cached processing result and all re-run the job).

...And Pipelines automatically tag SageMaker Experiments config (Pipeline = Experiment; Execution = Trial; Step = Trial Component) which you can then use to plot and compare multiple training jobs in the SM Studio UI. So for example your pipeline might just be Pre-process > Train > Evaluate > RegisterModel. If you right click your pipeline's "Experiment" in SMStudio Experiments and Trials view, you can open a list of the executed training jobs and select multiple to scatter-plot the final loss/accuracy vs the hyperparameters.

If you run your evaluation as a SageMaker Processing Job which outputs a JSON in model quality metrics format, you can even have your pipeline load the model into SM Model Registry tagged with this data. This way, you'd be able to see and compare the metrics between model versions (and even charts e.g. ROC curves in classification case) through the SMStudio Model Registry UI.

Some relevant code samples:

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions