跳至內容

How to use dynamic arguments in SageMaker job using Step Functions?

0

I am using AWS Step Functions to orchestrate some SageMaker jobs. However, I can't seem to dynamically retrieve the Var variable for each job. Meaning that each launched container should be run with a specific argument that is specified in the Jobs section in the following yaml file. Is there a workaround ? I have tried using Environment variable but it also doesn't support dynamic arguments.

Comment: "Run  SageMaker jobs in parallel with a max concurrency of 4"
StartAt: PrepareJobList
States:
  PrepareJobList:
    Type: Pass
    Result:
      Jobs: 
        - { Var: "X" }
        - { Var: "XX" }
    ResultPath: "$.JobList"
    Next: RunJobs

  RunJobs:
    Type: Map
    Iterator:
      StartAt: GenerateJobName
      States:
        GenerateJobName:
          Type: Pass
          Parameters:
            "TrainingJobName.$": "States.Format('Job-{}-{}', $.Var, States.UUID())"
            "Var.$": "$.Var"
          ResultPath: "$.JobDetails"
          Next: RunSageMakerJob

        RunSageMakerJob:
          Type: Task
          Resource: arn:aws:states:::sagemaker:createTrainingJob.sync
          Parameters:
            TrainingJobName.$: "$.JobDetails.TrainingJobName"
            AlgorithmSpecification:
              TrainingImage: "X"
              TrainingInputMode: "File"
              ContainerEntrypoint:
                - python
                - src/script.py
              ContainerArguments:
                - "--var"
                - "$.JobDetails.Var"
            StoppingCondition:
              MaxRuntimeInSeconds: 36000  
            RoleArn: "X"  
            ResourceConfig:
              InstanceType: "ml.m5.large"  
              InstanceCount: 1  
              VolumeSizeInGB: 50  
            OutputDataConfig:
              S3OutputPath: "x"
          Retry:
            - ErrorEquals: ["ResourceLimitExceeded"]
              IntervalSeconds: 300
              MaxAttempts: 10
              BackoffRate: 2.0
          End: true
    ItemsPath: "$.JobList.Jobs"
    MaxConcurrency: 4
    End: true

1 個回答
1
已接受的答案

It's not exactly clear from the question, but I'm guessing the error you see is that the unresolved "$.JobDetails.Var" string is what gets passed through to your job?

Unlike in TrainingJobName.$, you haven't ended the parameter name with .$ for ContainerArguments.$, to indicate that it's a pathspec.

However since this field is an array, I believe you'll need to use this trick for something like:

ContainerArguments.$: "States.Array('--var', $.JobDetails.Var)"

Unfortunately I don't have a similar complete example to hand to easily test with.

For what it's worth, you may prefer to use the HyperParameters dictionary for this? Hyperparams are more broadly visible in SageMaker console & UIs than container arguments - and would be eligible for things like automatic HP tuning later if you wanted to use that. Provided hyperparameters should get automatically written into a /opt/ml/input/config/hyperparameters.json file in your container.

If you wanted, you could use the sagemaker-training-toolkit in your custom container to load environment variables & run your script with hyperparam CLI arguments like SageMaker Script Mode does. If you're already using one of the AWS-provided framework images, they'll already have something like this (might be slightly different e.g. the sagemaker-pytorch-training-toolkit or similar).

AWS
專家
已回答 1 年前
AWS
專家
已審閱 9 個月前
  • The States.Array trick solved it! Thank you for the detailed explanation.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。