I want to troubleshoot the connection timeout error that I receive when I try to connect to an Amazon EMR cluster from my Amazon SageMaker Studio notebook.
Short description
You might receive ConnectTimeoutError because of network configuration issues with your Amazon Virtual Private Cloud (Amazon VPC), subnets, or security groups.
Prerequisites:
- You launched SageMaker Studio in VPC only mode.
- You launched the EMR cluster and SageMaker Studio notebook in the same VPC.
Note: If you used VPC peering to connect them from different VPCs, then you must manually configure the /etc/sparkmagic/config.json config file. The cluster's discovery functionality doesn't support connections across AWS Regions. For more information, see Build Amazon SageMaker notebooks backed by Spark in Amazon EMR.
- You launched the EMR cluster with Apache Spark and Apache Livy applications installed.
Resolution
Check Sagemaker Studio and EMR cluster configurations
Take the following actions:
- Confirm that you correctly configured the security groups or network access control lists (network ACLs) to allow traffic on port 8998. Perform this check for both the SageMaker Studio notebook and EMR cluster.
- For notebooks that are run on SageMaker Studio Classic, you must allow NFS traffic over port 2049 between the domain and Amazon Elastic File System (Amazon EFS) volume.
- Be sure that the EMR cluster's primary node security group has an inbound rule for a custom TCP over port 8998. This rule can specify either the SageMaker Studio's security group or a CIDR that includes the SageMaker Studio's subnet.
Note: If you use VPC peering, then see Update your security groups to reference peer security groups.
If you have a VPC peering connection between Amazon EMR and SageMaker Studio subnets, then their route tables must route traffic to each other. Otherwise, you get the ConnectTimeoutError.
Create AWS PrivateLink endpoints
If you set up your private subnet in VPC only mode without a NAT gateway, then create AWS PrivateLink interface endpoints. For Amazon EMR, use * com.amazonaws..elasticmapreduce. For AWS Security Token Service (AWS STS), use * com.amazonaws.sts. Create the endpoints in the VPC that you use with the EMR cluster and SageMaker Studio.
Resolve timeout errors
AWS STS is a global service. If you connect to an EMR cluster across AWS accounts from SageMaker Studio in a Region other than us-east-1, then the following error might occur:
"ConnectTimeoutError: Connect timeout on endpoint URL: "https://sts.amazonaws.com/""
To resolve this error, set the AWS_STS_REGIONAL_ENDPOINTS environment variable to regional within the Jupyter notebook:
%env AWS_STS_REGIONAL_ENDPOINTS=regional%load_ext sagemaker_studio_analytics_extension.magics
Then, run the connect command:
%sm_analytics emr connect --cluster-id example-cluster-id --auth-type None --assumable-role-arn arn:aws:iam::example-cross-account:role/example-role-name
For more information on Regional endpoints, see Managing AWS STS in an AWS Region and AWS STS Regionalized endpoints.
Verify the Connection
Open your SageMaker Studio notebook, select Sparkmagic kernel, and then run the following command in the cell to check if the connection works:
For connections within the same account:
%local!sm-sparkmagic connect --cluster-id example-cluster-id
For cross-account connections:
%local# If needed, use STS Regional endpoint
%env AWS_STS_REGIONAL_ENDPOINTS=regional
!sm-sparkmagic connect --cluster-id example-cluster-id --role-arn arn:aws:iam::example-cross-account:role/example-role-name
Or, run the following command from the notebook terminal on the Amazon EMR primary node's private IP address:
curl example-EMR-Master-Private-IP:8998/sessions -v
Perform the following checks to verify the correct configuration:
To get the PID for Livy, use SSH to connect to the EMR cluster:
ps -ef | grep livy
Check the port that Livy runs on:
sudo netstat -anp | grep example-PID
Confirm that Livy is running on the default port 8998.
Related information
Create and manage Amazon EMR clusters from SageMaker Studio to run interactive Spark and ML workloads Part 1
Create and manage Amazon EMR clusters from SageMaker Studio to run interactive Spark and ML workloads Part 2