Sagemaker Studio in VPC-Only mode not able to start notebook instances

0

Hello,

I am trying to set-up Sagemaker Studio Domain with AWS Identity Manager (SSO) and VPC-Only mode. I have created all Security Groups and VPC endpoints created per document --> https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-and-internet-access.html I have opened tcp traffic for 2049, 443, and 8192-65535 ports between VPC endpoint Security group and Sagemaker domain Security Group. I tried changing the security ingress and egress rule to allow "All Traffic" across "All Ports" too.

There are 2 visible problems I see.

  1. When I launch Sagemaker Studio from domain/user-profile, it take over 20-30 mins to launch. Cloudwatch logs repeatedly show below errors: Error: UnknownEndpoint: Inaccessible host: api.sagemaker.us-west-2.amazonaws.com'. This service may not be available in the us-west-2' region Error: UnknownEndpoint: Inaccessible host: jumpstart-cache-prod-us-west-2.s3.us-west-2.amazonaws.com'. This service may not be available in the us-west-2' region.

  2. Eventually Studio launches. But when I create new notebook, Notebook Kernel fails to start with below error: Had error starting kernel 2 times. Responded with error: Connection timeout between notebook and kernel.

I raised AWS support ticket too but the 1st customer support agent was not able to help by looking at all set-ups. Ticket has been escalated to next level but meanwhile I was wondering if anyone faced same issue in us-west-2?

2 Answers
1

Hi, I agree with your diagnosis: your 2 messages (Sagemaker and S3) about Inaccessible host tend to say the vpc endpoint security group(s) is(are) too stringent and do not allow the establishment of a tcp session toward those service endpoints.

So, it's weird that "Allow All Traffic" does not improve the situation. Did yo check that your service endpoints were active in your vpc?

What you can try to better understand at this stage is to launch a (small) EC2 instance - with proper execution role to access the service endpoints - in your vpc to try to telnet the endpoints and see if you get more details about the issue.

I personally use this telnet method when I get similar issues: https://netbeez.net/blog/telnet-to-test-connectivity-to-tcp/

Best, Didier

profile pictureAWS
EXPERT
answered 10 months ago
0

I encountered this same scenario and this what was causing it for me.

TLDR: Make sure you have the correct security group specified in your sagemaker domain default user settings

When you create a Sagemaker Domain you can specify defaultUserSettings. One of these settings is a list of security group ids. The security group ids that you include in this list are the ones Sagemaker Studio will use communicate to resources inside of your VPC.

If you do not specify the security groups here Sagemaker will default to a security group it creates that has a single outbound rule allowing for NFS communication on port 2049 so the sagemaker service can access the data in your efs. Since it has no rules for communication anywhere else it will fail to access the interface endpoints you may have set up - com.amazonaws.us-east-1.sagemaker.api being one of them.

Here is some CloudFormation showing which setting I am talking about:

"sagemakerDomain": {
   "Type": "AWS::SageMaker::Domain",
   "Properties": {
    "AppNetworkAccessType": "VpcOnly",
    "AuthMode": "IAM",
    "DefaultUserSettings": {
     "ExecutionRole": {
      "Fn::ImportValue": "sagemakerRoleArn"
     },
     "SecurityGroups": [
      {
       "Fn::GetAtt": [
        "defaultSagemakerSGF82111E2",
        "GroupId"
       ]
      }
     ]
    },
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions