How can I troubleshoot scheduled notebook jobs in SageMaker Studio?

4 minute read
1

When I run a scheduled notebook job in Amazon SageMaker Studio, I get an error.

Short description

There are two common errors that might prevent a scheduled notebook job in SageMaker Studio:

  • AccessDenied errors
  • UI errors when you try to update a job

Resolution

AccessDenied errors

AccessDenied errors most commonly involve the following issues:

  • AWS Identity and Access Management (IAM) policies
  • Virtual private cloud (VPC) endpoint policies
  • Resource tag exceptions

IAM policy issues

AccessDenied errors most commonly occur from permission based errors. Therefore, follow the best practices for the IAM role that you need for the notebook job. You need the following IAM role for the base trust relationship:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "sagemaker.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Also, verify that your IAM role has the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "sagemaker.amazonaws.com",
            "events.amazonaws.com"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "events:TagResource",
        "events:DeleteRule",
        "events:PutTargets",
        "events:DescribeRule",
        "events:PutRule",
        "events:RemoveTargets",
        "events:DisableRule",
        "events:EnableRule"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:PutBucketVersioning",
        "s3:PutEncryptionConfiguration"
      ],
      "Resource": "arn:aws:s3:::sagemaker-automated-execution-*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:ListTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:user-profile/*",
        "arn:aws:sagemaker:*:*:space/*",
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:AddTags"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:training-job/*",
        "arn:aws:sagemaker:*:*:pipeline/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:CreateVpcEndpoint",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeSubnets",
        "ec2:DescribeVpcEndpoints",
        "ec2:DescribeVpcs",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetAuthorizationToken",
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetEncryptionConfiguration",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:GetObject",
        "sagemaker:DescribeDomain",
        "sagemaker:DescribeUserProfile",
        "sagemaker:DescribeSpace",
        "sagemaker:DescribeStudioLifecycleConfig",
        "sagemaker:DescribeImageVersion",
        "sagemaker:DescribeAppImageConfig",
        "sagemaker:CreateTrainingJob",
        "sagemaker:DescribeTrainingJob",
        "sagemaker:StopTrainingJob",
        "sagemaker:Search",
        "sagemaker:CreatePipeline",
        "sagemaker:DescribePipeline",
        "sagemaker:DeletePipeline",
        "sagemaker:StartPipelineExecution"
      ],
      "Resource": "*"
    }
  ]
}

For more information, see AWS managed policies for SageMaker notebooks.

VPC endpoint issues

If you initiate the notebook job through a VPC endpoint, then check the endpoint's configuration and policy. Make sure that you follow the steps and best practices for the relevant service endpoint:

For Amazon S3 VPC endpoints, the most common error relates to an endpoint that's restricted to a single account. For example, the following policy restricts access to an account with the ID 111122223333:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSpecificAccountsPermission",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "s3:ResourceAccount": "111122223333"
        }
      }
    }
  ]
}

In this case, you must also allow the following bucket access for the user's actions:

{
  "Action": [
    "s3:*"
  ],
  "Resource": [
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*",
    "arn:aws:s3:::sagemakerheadlessexecution-prod-*/*"
  ],
  "Effect": "Allow",
  "Sid": "SCTASK14554266"
}

Resource tag exceptions

Make sure that your IAM policy has the follows permissions:

{
  "Effect": "Allow",
  "Action": [
    "events:TagResource",
    "events:DeleteRule",
    "events:PutTargets",
    "events:DescribeRule",
    "events:PutRule",
    "events:RemoveTargets",
    "events:DisableRule",
    "events:EnableRule"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/sagemaker:is-scheduling-notebook-job": "true"
    }
  }
}

UI errors when you try to update a job

You might encounter a UI error when you try to create, describe, update, stop, or delete a notebook job. You might also encounter this issue with job definitions (scheduled jobs). To troubleshoot this, first note the error message that appears in the UI. This message often contains directions or suggestions actions to resolve the issue.

If you can't resolve the error, then complete the following steps:

  1. Take a screenshot of the error, and then save it as an image file.
  2. Create an HTTP Archive (HAR) file that captures the network traffic when the UI error occurs.
  3. Go to SageMaker Studio's Jupyter server terminal. Choose File, New, Terminal.
  4. Check the logs in /var/log/apps/app_container.log for exceptions, errors, or warnings at the time of the UI error.
  5. Contact AWS Support through the AWS Support Center. In your request, attach the error screenshot, the app_container.log, and the HAR file.
AWS OFFICIAL
AWS OFFICIALUpdated 6 months ago
3 Comments

Also having issues with custom images in notebook jobs. I get an error running the update-domain call because in the example there is no sample ImageName or AppImageConfigName, can you please clarify what these values should be? Can these be adjusted via console? Do we have to create a new image version for an existing image after applying? Also I'm unable to find the ARN in /opt/.sagemakerinternal/internal-metadata.json

AR
replied 9 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 9 months ago

Hello @AR for specific use case scenarios like a custom image I would recommend reaching out to AWS support for further details and elaborations on this

AWS
SUPPORT ENGINEER
replied 8 months ago