如何對 SageMaker Studio 排程的筆記本任務進行疑難排解?

2 分的閱讀內容
0

當我在 Amazon SageMaker Studio 中執行排程的筆記本任務時,會出現錯誤。

簡短描述

有兩個常見錯誤可能會阻止 SageMaker Studio 中進行排程的筆記本任務:

  • AccessDenied 錯誤
  • 嘗試更新任務時的 UI 錯誤

解決方案

AccessDenied 錯誤

AccessDenied 錯誤通常涉及下列問題:

  • AWS Identity and Access Management (IAM) 政策
  • 虛擬私有雲端 (VPC) 端點政策
  • 資源標籤例外狀況

IAM 政策問題

AccessDenied 錯誤通常由權限型錯誤所造成。因此,請遵循筆記本任務所需的 IAM 角色的最佳做法。建立基本信任關係需要以下 IAM 角色:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

此外,請驗證您的 IAM 角色是否具備以下權限:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "sagemaker.amazonaws.com",
            "events.amazonaws.com"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "events:TagResource",
        "events:DeleteRule",
        "events:PutTargets",
        "events:DescribeRule",
        "events:PutRule",
        "events:RemoveTargets",
        "events:DisableRule",
        "events:EnableRule"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:PutBucketVersioning",
        "s3:PutEncryptionConfiguration"
      ],
      "Resource": "arn:aws:s3:::sagemaker-automated-execution-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:ListTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:user-profile/*",
        "arn:aws:sagemaker:*:*:space/*",
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:AddTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:CreateVpcEndpoint",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeVpcEndpoints",
        "ec2:DescribeVpcs",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetAuthorizationToken",
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetEncryptionConfiguration",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:GetObject",
        "sagemaker:DescribeDomain",
        "sagemaker:DescribeUserProfile",
        "sagemaker:DescribeSpace",
        "sagemaker:DescribeStudioLifecycleConfig",
        "sagemaker:DescribeImageVersion",
        "sagemaker:DescribeAppImageConfig",
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:Search",
        "sagemaker:CreatePipeline",
        "sagemaker:DescribePipeline",
        "sagemaker:DeletePipeline",
        "sagemaker:StartPipelineExecution"
      ],
      "Resource": "*"
    }
  ]
}

如需詳細資訊,請參閱 SageMaker 筆記本的 AWS 受管政策

VPC 端點問題

如果您透過 VPC 端點啟動筆記本任務,請檢查端點的組態和政策。確保您遵循相關服務端點的步驟和最佳做法:

對於 Amazon S3 VPC 端點,最常見的錯誤與限制為單一帳戶的端點有關。例如,下列政策會限制對 ID 為 111122223333 的帳戶存取權限:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSpecificAccountsPermission",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "s3:ResourceAccount": "111122223333"
        }
      }
    }
  ]
}

在此情況下,您還必須允許使用者存取以下儲存貯體:

{
  "Action": [
    "s3:*"
  ],
  "Resource": [
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*",
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*/*"
  ],
  "Effect": "Allow",
  "Sid": "SCTASK14554266"
}

資源標籤例外狀況

確保您的 IAM 政策具備以下權限:

{
  "Effect": "Allow",
  "Action": [
    "events:TagResource",
    "events:DeleteRule",
    "events:PutTargets",
    "events:DescribeRule",
    "events:PutRule",
    "events:RemoveTargets",
    "events:DisableRule",
    "events:EnableRule"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
    }
  }
}

嘗試更新任務時的 UI 錯誤

當您嘗試建立描述更新停止刪除筆記本任務,可能會遇到 UI 錯誤。任務定義 (已排程的任務) 方面也可能會遇到這個問題。若要對此問題進行疑難排解,請先記下 UI 出現的錯誤訊息。此訊息通常包含解決問題的指示或建議動作。

如果無法解決錯誤,請完成以下步驟:

  1. 擷取錯誤的螢幕擷取畫面,然後將其另存為影像檔案。
  2. 建立 HTTP 封存 (HAR) 檔案,以在發生 UI 錯誤時擷取網路流量。
  3. 前往 SageMaker Studio 的 Jupyter 伺服器終端。選擇檔案、新增、終端
  4. 檢查 /var/log/apps/app_container.log 的日誌,查看在出現 UI 錯誤時是否存在例外狀況、錯誤或警告。
  5. 透過 AWS Support Center 聯絡 AWS Support。附加錯誤的螢幕擷取畫面、app_container.log 和 HAR 檔案在您的請求中。
AWS 官方
AWS 官方已更新 6 個月前