為什麼我的 Amazon SageMaker 管道執行會失敗?

1 分的閱讀內容
0

我想要對 Amazon SageMaker 管道執行失敗的原因進行疑難排解。

解決方案

若要對 SageMaker 的管線執行失敗進行疑難排解,請執行下列操作:

**備註:**如果您在執行 AWS CLI 命令時收到錯誤,請確保您使用的是最新版 AWS CLI

1.    執行 AWS Command Line Interface (AWS CLI) 命令 list-pipeline-executions

**備註:**如果您的本機電腦未設定 AWS CLI,請使用 AWS CloudShell 主控台

$ aws sagemaker list-pipeline-executions --pipeline-name test-pipeline-p-wzx9cplzrvdk

此命令會傳回管線的管線執行清單,看起來類似下列內容:

"PipelineExecutionSummaries": [
        {
            "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b",
            "StartTime": "2022-09-27T12:56:44.646000+00:00",
            "PipelineExecutionStatus": "Failed",
            "PipelineExecutionDisplayName": "execution-1664283404791",
            "PipelineExecutionFailureReason": "Step failure: One or multiple steps failed."
        },
        {
            "PipelineExecutionArn": "arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/acvref9y1f47",
            "StartTime": "2022-09-27T12:13:28.762000+00:00",
            "PipelineExecutionStatus": "Succeeded",
            "PipelineExecutionDisplayName": "execution-1664280808943"
        }
    ]
}

2.    執行 list-pipeline-executions-steps 命令,以檢視失敗的步驟:

$ aws sagemaker list-pipeline-execution-steps --pipeline-execution-arn arn:aws:sagemaker:eu-west-1:1111222233334444:pipeline/test-pipeline-p-wzx9cplzrvdk/execution/lvejn1jl827b

輸出類似於以下內容:

{
    "PipelineExecutionSteps": [
        {
            "StepName": "TrainAbaloneModel",
            "StartTime": "2022-09-27T13:00:49.235000+00:00",
            "EndTime": "2022-09-27T13:01:50.056000+00:00",
            "StepStatus": "Failed",
            "AttemptCount": 0,
            "FailureReason": "ClientError: ClientError: Please ensure the security group provided is valid",
            "Metadata": {
                "TrainingJob": {
                    "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:training-job/pipelines-lvejn1jl827b-trainabalonemodel-u9l9wjassg"
                }
            }
        },
        {
            "StepName": "PreprocessAbaloneData",
            "StartTime": "2022-09-27T12:56:45.595000+00:00",
            "EndTime": "2022-09-27T13:00:48.638000+00:00",
            "StepStatus": "Succeeded",
            "AttemptCount": 0,
            "Metadata": {
                "ProcessingJob": {
                    "Arn": "arn:aws:sagemaker:eu-west-1:1111222233334444:processing-job/pipelines-lvejn1jl827b-preprocessabalonedat-6axq0kthyg"
                }
            }
        }
    ]
}

在此情況下,訓練任務步驟失敗,是因為在該任務的 vPCConfig 物件中指定不存在的安全群組。

如果失敗步驟的失敗原因不明,請檢查 Amazon CloudWatch Logs 中是否有失敗的 SageMaker 任務或端點,以進一步進行疑難排解。您可以在 CloudWatch 日誌群組 /aws/sagemaker/TrainingJobs 中查看訓練任務的日誌。日誌串流看起來類似下列內容:

example-training-job-name/algo-example-instance-number-in-cluster-example-epoch-timestamp


相關資訊

使用 Amazon CloudWatch 記錄 Amazon SageMaker 事件

AWS 官方
AWS 官方已更新 2 年前