Temporary connectivity issues from Win2019 EC2 Instance to metadata and other endpoints

0

We experience irregular temporary connectivity issues on our Windows Server 2019 EC2 instances. Theses issues only occur on some instances (2 out of 12 machines) and are not specific to subnet/security groups (both machines are in different subnets/security groups, other machines in the same subnet/security groups do not experience the issues). The issues can be found in various logs:

# Cloud Watch Agent Log (Excerpt)

2022-11-03T11:34:17Z E! WriteToCloudWatch failure, err:  RequestError: send request failed
caused by: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.138.113:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:35:15Z E! cloudwatch: code: RequestError, message: send request failed, original error: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.138.199:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:35:15Z W! 4 retries, going to sleep 3.2s before retrying.
2022-11-03T11:35:18Z E! WriteToCloudWatch failure, err:  RequestError: send request failed
caused by: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.138.199:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:36:11Z W! [processors.ec2tagger] ec2tagger: Error refreshing EC2 tags, keeping old values : +RequestError: send request failed
caused by: Post "https://ec2.eu-central-1.amazonaws.com/": dial tcp 54.239.55.167:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:36:11Z W! [processors.ec2tagger] ec2tagger: Error refreshing EC2 tags, keeping old values : +RequestError: send request failed
caused by: Post "https://ec2.eu-central-1.amazonaws.com/": dial tcp 54.239.55.167:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:36:14Z E! cloudwatch: code: RequestError, message: send request failed, original error: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.136.226:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:36:14Z W! 5 retries, going to sleep 6.4s before retrying.
2022-11-03T11:36:20Z E! WriteToCloudWatch failure, err:  RequestError: send request failed
caused by: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.136.226:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:37:15Z E! cloudwatch: code: RequestError, message: send request failed, original error: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.136.211:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:37:15Z W! 6 retries, going to sleep 1m0s before retrying.
2022-11-03T11:38:14Z E! cloudwatch: code: RequestError, message: send request failed, original error: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.136.211:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:38:14Z W! 7 retries, going to sleep 1m0s before retrying.
2022-11-03T11:38:15Z E! WriteToCloudWatch failure, err:  RequestError: send request failed
caused by: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.136.211:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03T11:39:14Z E! WriteToCloudWatch failure, err:  RequestError: send request failed
caused by: Post "https://monitoring.eu-central-1.amazonaws.com/": dial tcp 52.94.136.211:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

# SSM Agent Log (Excerpt)

2022-11-03 12:22:59 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [MessageService] [MDSInteractor] error when calling AWS APIs. error details - GetMessages Error: EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-11-03 12:23:02 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [MessageService] [MDSInteractor] error when calling AWS APIs. error details - GetMessages Error: EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-11-03 12:23:05 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [MessageService] [MDSInteractor] error when calling AWS APIs. error details - GetMessages Error: EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-11-03 12:23:08 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [MessageService] [MDSInteractor] error when calling AWS APIs. error details - GetMessages Error: EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-11-03 12:23:11 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [MessageService] [MDSInteractor] error when calling AWS APIs. error details - GetMessages Error: EC2RoleRequestError: no EC2 instance role found
caused by: RequestError: send request failed
caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-11-03 12:23:11 ERROR [checkStopPolicy @ mdsinteractor.go.391] [ssm-agent-worker] [MessageService] [MDSInteractor] MDSInteractor stopped temporarily due to internal failure. We will retry automatically after 15 minutes
2022-11-03 12:33:05 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed
caused by: Post "https://ssm.eu-central-1.amazonaws.com/": dial tcp 52.119.188.195:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03 12:33:05 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed
caused by: Post "https://ssm.eu-central-1.amazonaws.com/": dial tcp 52.119.188.195:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03 12:38:07 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed
caused by: Post "https://ssm.eu-central-1.amazonaws.com/": dial tcp 52.119.188.195:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03 12:38:07 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [HealthCheck] error when calling AWS APIs. error details - RequestError: send request failed
caused by: Post "https://ssm.eu-central-1.amazonaws.com/": dial tcp 52.119.188.195:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
2022-11-03 12:38:43 ERROR [HandleAwsError @ awserr.go.49] [ssm-agent-worker] [MessageService] [Association] error when calling AWS APIs. error details - RequestError: send request failed
caused by: Post "https://ssm.eu-central-1.amazonaws.com/": dial tcp 52.119.190.128:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

We found out about these connectivity issues since our Cloud Watch Monitoring had gaps in some metrics provided by the Cloud Watch Agent. These gap seem to occur randomly and last for about 90 minutes.

Cloud Watch Metric

I could not find any related problems in the windows event log. Surprisingly, our windows docker workloads on the affected machines seem to work normal.

Has anyone experienced similar network connectivity problems? Any suggestions on how to further investigate the root for these issues?

asked a year ago291 views
1 Answer
0

In general intermittent connectivity issues are often noticeable when the EC2 instance is running out of hardware resources due to a long running task. How does the CPU and Memory utilization looks when the network drop is noticed? Open Windows Task Manager to have a quick glance when you notice the drop again.

Also please check if you are able to ping the EC2 instance during this time.

For more information about diagnosing high resource utilization on EC2, refer to article- https://aws.amazon.com/premiumsupport/knowledge-center/ec2-cpu-utilization-not-throttled/

profile pictureAWS
SUPPORT ENGINEER
answered a year ago
  • Thank you for your suggestions. I had a look at the hardware resources and they did not show any bottlenecks. They are essentially the same as before and after the drop. Since our workload works despite the drops and so far we only lose two metrics in our monitoring during these drops, we decided to accept these drops for now. We plan to switch to Win2022 Server with a different docker setup in the next months. Hopefully this will also fix these gaps in the metrics.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions