Spot instance integration with automl interactive runner

0

I'm working on an ML project and have the current workflow: 1: use sagemaker studio to propose test configurations for my project 2: export to sagemaker notebooks 3: adjust instance types 4: run project from sagemaker notebook

The output of 1 results in a notebook with 4-6 candidates, which are initially trained before hyperparameter optimization is run later on. The whole process uses the sagemaker_autoML.AutoMLInteractiveRunner pipeline. An example candidate before modification is shown below:

automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp0",
        "training_resource_config": {
            "instance_type": "ml.m5.12xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.12xlarge",
            "instance_count": 1,
        },
    }
})

My models usually only contain ~700 training instances and ~100 testing instances, so I've found that changing instance types to ml.g4dn.xlarge for training and ml.m5.large for inference cuts the cost by a factor of 3 and is even faster due to GPU acceleration. I've been able to successfully modify the instance types as stated without issue, but I'd also like to enable spot instances. My attempted changes to the candidates have looked like such:

automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp3",
        "training_resource_config": {
            "instance_type": "ml.g4dn.xlarge", #works
            "instance_count": 1,
            "use_spot_instances": True,  #does not work
            "max_run": 1800, #does not work
            "max_wait":3600, #does not work
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.large",
            "instance_count": 1,
        },
        "transforms_label": True,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.g4dn.xlarge", #works
            "instance_count": 1,
            "use_spot_instances": True, #does not work
            "max_run": 1800, #does not work
            "max_wait":3600 #does not work
        },
    }
})

However, when I look at the training jobs that this notebook makes after modification, it shows that managed spot training is disabled, which is further verified by the billable time and training time being equal. My questions are as follows:

  1. How can I turn on spot instances for training these candidates?
  2. How can I turn on spot instances for Hyperparameter optimization?
  3. Where is the documentation for sagemaker_autoML.AutoMLInteractiveRunner? Any other recommendations on how to streamline this workflow are also welcome, as well!
mluser
已提问 6 个月前339 查看次数
1 回答
0

Hello,

Thank you for using SageMaker Service.

It is observed that the, Autopilot doesn't support GPU training for Tabular use-case. To further investigate any relevant issues with regards to the jobs created by Auto Pilot we would required job ARN and other major details which we do not recommend to share via this portal. I would highly encourage you to open a case with Support engineering to further investigate the issue if the issue persist.

To open a support case with AWS using the link:

https://console.aws.amazon.com/support/home?#/case/create

AWS
支持工程师
已回答 5 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则