EC2 Instance Reachability Check Failure

0

I have an EC2 instance running AL2023 and set up as a web server with two attached EBS volumes and an Elastic IP.

The volumes are set to automount on boot via fstab entries.

The instance runs fine for about a month; then, there will appear to be a CPU and network spike, and the instance will become unreachable via a web browser or through an SSH connection.

Rebooting the instance temporarily fixes the issue and allows access again.

I have looked at the system log, and SSM Agent log, but I can't see any errors.

Is there another log I should look into or something else to check?

My searching in docs and here seems to bring back results for instances that remain unreachable after rebooting, whereas mine does not.


21st May 2024 Another reachability failure occurred on 16/05/2024 16:54

Looking in /var/log/messages I have found some failures/errors leading up to that time. I don't know if they are relevant. Examples below.

I've found some errors in /var/logs/messages

May 16 15:03:35 ip-172-31-25-82 systemd-networkd[1994]: enX0: Could not set DHCPv4 address: Connection timed out May 16 15:20:00 ip-172-31-25-82 systemd-networkd[1994]: enX0: Failed

May 16 15:59:26 ip-172-31-25-82 systemd-networkd-wait-online[256722]: Timeout occurred while waiting for network connectivity. May 16 16:16:47 ip-172-31-25-82 audit[15006]: AVC avc: denied { read write } for pid=15006 comm="mariadbd" name="wp_options.MYD" dev="xvdf" ino=33603458 scontext=system_u:system_r:mysqld_t:s0 tcontext=unconfined_u:object_r:unlabeled_t:s0 tclass=file permissive=1 May 16 16:22:17 ip-172-31-25-82 audit[15006]: AVC avc: denied { open } for pid=15006 comm="mariadbd" path="/vol/data/mysql/studioof_wp/wp_options.MYD" dev="xvdf" ino=33603458 scontext=system_u:system_r:mysqld_t:s0 tcontext=unconfined_u:object_r:unlabeled_t:s0 tclass=file permissive=1 May 16 16:27:32 ip-172-31-25-82 chronyd[2500]: Can't synchronise: no selectable sources May 16 16:36:44 ip-172-31-25-82 systemd[1]: refresh-policy-routes@enX0.service: Main process exited, code=exited, status=1/FAILURE May 16 16:36:45 ip-172-31-25-82 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=refresh-policy-routes@enX0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 16 16:36:45 ip-172-31-25-82 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:sy

May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. Error: EC2RoleRequestError: no EC2 instance role found May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: RequestError: send request failed May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": dial tcp 169.254.169.254:80: connect: network is unreachable May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 ERROR [TokenRequestService] failed to retrieve instance identity role. Error: EC2MetadataError: failed to get IMDSv2 token and fallback to IMDSv1 is disabled May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: : May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: #011status code: 0, request id: May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: RequestError: send request failed May 16 16:36:45 ip-172-31-25-82 amazon-ssm-agent[118643]: caused by: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: network is unreachable May 16 16:36:46 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: unable to build RSA signature. No Authorization header in request May 16 16:36:46 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management Err: error calling RequestManagedInstanceRoleToken: unable to build RSA signature. No Authorization header in request May 16 16:36:46 ip-172-31-25-82 amazon-ssm-agent[118643]: 2024-05-16 16:36:45 INFO [CredentialRefresher] Sleeping for 5m0s before retrying retrieve credentials May 16 16:37:54 ip-172-31-25-82 systemd[1]: Starting refresh-policy-routes@enX0.service - Refresh policy routes for enX0... May 16 16:37:54 ip-172-31-25-82 ec2net[256814]: Starting configuration for enX0 May 16 16:39:54 ip-172-31-25-82 systemd-networkd-wait-online[256816]: Timeout occurred while waiting for network connectivity. May 16 16:39:55 ip-172-31-25-82 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=refresh-policy-routes@enX0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed' May 16 16:39:55 ip-172-31-25-82 systemd[1]: refresh-policy-routes@enX0.service: Main process exited, code=exited, status=1/FAILURE May 16 16:39:55 ip-172-31-25-82 systemd[1]: refresh-policy-routes@enX0.service: Failed with result 'exit-code'.

It appears to start with access errors to the MariaDB sat on an attached volume.

  • i have faced this issue myself so can you try changing your ip for once it might solve

  • please accept the answer if it was useful

David
asked 14 days ago90 views
1 Answer
1
  • Since you've already checked the system log and SSM Agent log without finding any clues, consider looking into other logs such as /var/log/messages or /var/log/syslog depending on your Linux distribution. These logs might provide more information on what happens right before the instance becomes unresponsive.
  • Review the logs for any services that are critical to your application, such as web server logs (/var/log/httpd/ for Apache or /var/log/nginx/ for Nginx).
  • Consider the possibility of resource exhaustion, such as running out of memory or hitting file descriptor limits. Check /var/log/dmesg and kern.log for any kernel-level issues.
  • Tools like htop or vmstat can be used to monitor system resources in real-time and might help identify patterns leading up to the spike.
profile picture
EXPERT
answered 14 days ago
  • I've just had another case of it, so I'm working through your advice and will update you shortly. Cheers.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions