metadata service is unstable: connection timeout, Failed to connect to service endpoint etc
start from recently, our long running job are hitting metadata issue frequently. The exceptions various, but the all point to EC2 metadata service. It's either failed to connection the endpoint, or timeout to connect to the service, or complaining that I need to specify the region while building the client. The job is running on EMR 6.0.0 in Tokyo, with correct Role set, and the job has been running fine for months, just started from recent, it became unstable.
So my question is: how can we monitor the healthy the metadata service? request QPS, success rate, etc.
A few callstacks
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain: [com.amazon.ws.emr.hadoop.fs.guice.UserGroupMappingAWSSessionCredentialsProvider@4a27ee0d: null, com.amazon.ws.emr.hadoop.fs.HadoopConfigurationAWSCredentialsProvider@76659c17: null, com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.auth.InstanceProfileCredentialsProvider@5c05c23d: Failed to connect to service endpoint: ] at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:136)
com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region. at com.amazonaws.client.builder.AwsClientBuilder.setRegion(AwsClientBuilder.java:462) at com.amazonaws.client.builder.AwsClientBuilder.configureMutableProperties(AwsClientBuilder.java:424) at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
com.amazonaws.SdkClientException: Unable to execute HTTP request: mybucket.s3.ap-northeast-1.amazonaws.com at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189) ~[aws-java-sdk-bundle-1.11.711.jar:?] Caused by: java.net.UnknownHostException: mybucket.s3.ap-northeast-1.amazonaws.com at java.net.InetAddress.getAllByName0(InetAddress.java:1281) ~[?:1.8.0_242] at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_242] at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_242]
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Failed to connect to service endpoint: Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
As of today, there is no specific metric available to monitor EC2 IMDS metrics like request QPS, success rate etc.
Kindly review the application logs pertaining to each exceptions, as the outlined exceptions could also occur due to various other issues in the environment like network (Failed to connect to service endpoint), DNS ( Unable to execute HTTP request) etc for example.
Activity workflow threw ActivityTaskTimedOut but process is still runningAccepted Answerasked a year ago
Issue with the Nice DCV serverasked a year ago
metadata service is unstable: connection timeout, Failed to connect to service endpoint etcasked 4 months ago
ECS Fargate - How to get IP AddressAccepted Answerasked 4 months ago
about endpoint connectionasked 4 days ago
RDP Connection Issueasked 9 months ago
AWS DMS task is failing on source endpoint timeoutasked a month ago
EC2 MySQL Connection Drops using Private IP Works with Public IPasked 2 years ago
SSM agent service failed to start on windows-server 2019 (datacenter)asked 2 months ago
Issues connecting to ElastiCache Memcachedasked 2 years ago