Why doesn't my Amazon SageMaker Studio notebook in VPC only mode connect with my KernelGateway app?

8 minute read
0

I'm having connectivity issues between my Amazon SageMaker Studio notebook in VPC only mode and my KernelGateway app.

Short description

You might get the following errors when you use SageMaker Studio in VPC only mode and you're unable to launch the KernelGateway app:

You are able to launch SageMaker Studio, but your kernel fails with the following error:

SageMaker Studio is unable to connect KernelGateway App. In VPCOnly mode, please ensure that security groups allow TCP traffic within the security group

Usually you get this error because the security group is not self-referencing to allow connectivity between the instances within your SageMaker Domain.

For example, suppose that you can launch SageMaker Studio, but it takes a long time to load and the kernel fails to launch:

Failed to start kernel
Failed to launch app [None]. SageMaker Studio is unable to reach SageMaker endpoint. Please ensure your VPC has connectivity to SageMaker via Internet or VPC Endpoint. If you are using VPC Endpoints, please ensure Security Groups allows traffic between Studio and VPC endpoints.

This error occurs when your VPC-only domain can't connect to the internet or Amazon Virtual Private Cloud (Amazon VPC) endpoints. This might be due to several reasons, such as the following:

  • Security groups aren't configured correctly.
  • Your subnet doesn't have the correct VPC endpoints.
  • Your domain is connected to a private subnet, and no active NAT gateway is added to your route table.
  • You set up your SageMaker Studio to connect to public subnets.

Resolution

Be sure that the security groups for SageMaker Studio include the required rules

Be sure that the AWS Network File System (AWS NFS) traffic between the domain and Amazon Elastic File System (Amazon EFS) volume is allowed over TCP on port 2049. Your SageMaker Studio data is stored using Amazon EFS. Therefore, you must have the rules to allow inbound and outbound connections for storage purposes.

To allow inbound traffic from Amazon EFS to your resources, do the following:

  1. Open the Amazon VPC console.
  2. In the navigation pane, choose Security Groups.
  3. Select the security group that you want to update.
  4. Choose Actions, and then choose Edit inbound rules.
  5. Choose Add rule, and do the following:
    For Type, choose NFS.
    For Source, choose Custom, and then enter the Amazon EFS ID.
  6. Choose Save rules.

You must allow TCP traffic within the security group for allowing connectivity between the JupyterServer and KernelGateway apps. Because you created the Studio domain in VPC only mode, you must specify at least one security group for your SageMaker Studio domain resources. This security group must allow inbound traffic over TCP on ports 8192-65535 and all outbound traffic to 0.0.0.0/0.

To allow connectivity between the JupyterServer and KernelGateway apps, do the following:

  1. Open the Amazon VPC console.
  2. In the navigation pane, choose Security Groups.
  3. Select the security group that you want to update.
  4. Choose Actions, and then choose Edit inbound rules.
  5. Choose Add rule, and do the following:
    For Type, choose Custom TCP.
    For Port range, enter 8192-65535.
    For Source, choose Custom, and then enter the security group ID of the security group that you're editing.
  6. Choose Save rules.

When you access a resource in your Amazon VPC from your SageMaker Studio notebook, traffic from the SageMaker service account is routed through your elastic network interface. Note that both JupyterServer and KernelGateway apps are in your SageMaker service account VPC. They communicate with each other through the elastic network interfaces that are attached to your VPC. Although these apps are part of the SageMaker Studio domain service account, they run on different Amazon Elastic Compute Cloud (Amazon EC2) instances. These apps use the ephemeral ports to establish a connection between each other. There is no specific port over which these apps connect. Therefore, it's a best practice to allow all the TCP ports to be open in self-referencing security groups. For more information, see Dive deep into Amazon SageMaker Studio notebooks architecture.

Be sure that you created the required VPC endpoints

If your SageMaker Studio resources don't require access to the internet, then you don't need to add a NAT gateway. However, the following endpoints are required to run Studio notebooks:

  • SageMaker API: com.amazonaws.<aws-region>.sagemaker.api
  • SageMaker runtime: com.amazonaws.<aws-region>.sagemaker.runtime

Be sure to create the following endpoints to access Amazon Simple Storage Service (Amazon S3) and Project templates:

  • For Amazon S3: com.amazonaws.<aws-region>.s3
  • For Amazon SageMaker Project templates: com.amazonaws.<aws-region>.servicecatalog

Be sure to associate the security groups for your VPC with these VPC endpoints by doing the following:

  1. Open the Amazon VPC console.
  2. In the navigation pane, choose Endpoints.
  3. Choose the endpoint that you want to update.
  4. Choose Actions, and then choose Manage security groups.
  5. Select the security group that must be associated with this endpoint.
  6. Choose Save.

For more information, see the following:

Be sure to use a NAT gateway if you need internet connectivity

If your SageMaker Studio resources require access to the internet, first be sure that your SageMaker Studio is set up to connect to private subnets. Then, create a NAT gateway, and allow the traffic from the NAT gateway through your private subnet's route table. For more information, see How do I set up a NAT gateway for a private subnet in Amazon VPC? Note that the SageMaker Studio domain that's connected to a public subnet doesn't allow you to connect to the internet.

Be sure that the network requirements for your VPC are met

If you launched your SageMaker Studio in VPC only mode, then be sure that your VPC meets the following requirements:

  • Subnets must have enough available IP addresses for the instance.
  • To allow internet access, be sure to associate your SageMaker domain with a private subnet during domain creation. Also, use NAT gateway for internet access.
  • If you're using a VPC endpoint for running SageMaker APIs, then make sure that the attributes Enable DNS hostnames and Enable DNS Support are set to true for your VPC. This is required for your VPC to connect to the SageMaker API endpoint when starting up the kernel.

You can use AWS Command Line Interface (AWS CLI) commands to make sure that the correct security groups are attached to the domain. To update your Studio domain's DefaultUserSettings to use the new security group, use the update-domain command:

aws sagemaker update-domain –domain-id <value> --default-user-settings SecurityGroups=<list>

You can also reconfigure the domain by recreating the domain that's attached to the necessary security groups. The output against the SecurityGroups parameter lists all the security groups for the VPC that Studio uses for communication.

Note: To run the preceding command, you must delete all the Apps with InService status from your user profiles.

After the update-domain command is successful, you can check your domain using the describe-domain command:

Example:

$ aws sagemaker describe-domain --domain-id d-xyzxyz

Then, launch SageMaker Studio again and confirm that the notebook is starting up correctly. You can also test the internet connectivity by running !curl amazon.com from within a notebook cell.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

Delete the JupyterServer app and start a new one for the updated settings to take effect. You can use your SageMaker Studio user profile after updating the Amazon VPC settings. For more information, see the Requirements to use VPC only mode section in Connect SageMaker Studio notebooks in a VPC to external resources.

Other considerations

If only one user experiences this issue, check whether the default app was launched before the VPC updates were completed. In this case, the default JupyterServer app doesn't get automatically updated to utilize the new VPC configuration, resulting in connectivity issues. Also, check if the default JupyterServer app was launched before weeks or months. This might result in the app having large log files and temp files. Try recreating the default app to free up space or to make sure that the app uses the updated VPC configuration.

The issue might happen if SageMaker Studio users are configured with a different execution role. Be sure that the users' execution role permissions include the required policies. These policies must turn on the execution role to run the DescribeApp action that's required to create Studio notebooks. After you update these permissions for the execution role, try to provision Studio notebooks in VPC only mode.


AWS OFFICIAL
AWS OFFICIALUpdated a year ago