EC2 - Could not set DHCPv4 address: Connection timed out (sa-east-1a)

0

Our c6i.2xlarge 3-year reserved instance, running for its first 5 days, generated this log entry Could not set DHCPv4 address: Connection timed out on Jan 28 02:59:51 UTC, followed by Failed and Configured. From there on, the machine became unresponsive and AWS finally raised a StatusCheckFailed_Instance at 06:59 UTC. At 09:06 UTC machine was stopped and restarted through the Console.

I found these apparently related issues, but still clueless:

CoreOS goes offline on DHCP failure on Amazon VPC

CoreOS on EC2 losing network connection once a day

The box is running MySQL 5.7.36 and Memcache 1.5.6 on top of Ubuntu 18.04. I would be thankful if someone could help me identify the root cause of this issue, and:

  1. Could this be related to ntp-systemd-netif.service ?

  2. This instance type has a separate channel for EBS, but with network down, and no customers making requests (no usage logs on the application machine, except the "MySQL connection timeouts"), what would explain a surge on EBS disk reads? CloudWatch graphs below.

  3. We have an EFS disk attached to this instance, that started failing at 04:04 UTC probably related to network failure. No errors reported at EFS sa-east São Paulo status page.

    Jan 28 02:17:01 ip-172-xxx-xxx-xxx CRON[18179]: pam_unix(cron:session): session opened for user root by (uid=0)
    Jan 28 02:17:01 ip-172-xxx-xxx-xxx CRON[18180]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
    Jan 28 02:17:01 ip-172-xxx-xxx-xxx CRON[18179]: pam_unix(cron:session): session closed for user root
    Jan 28 02:29:11 ip-172-xxx-xxx-xxx systemd-networkd[728]: ens5: Configured
    Jan 28 02:29:11 ip-172-xxx-xxx-xxx systemd-timesyncd[623]: Network configuration changed, trying to establish connection.
    Jan 28 02:29:12 ip-172-xxx-xxx-xxx systemd-timesyncd[623]: Synchronized to time server 169.254.169.123:123 (169.254.169.123).
    Jan 28 02:29:12 ip-172-xxx-xxx-xxx systemd[1]: Started ntp-systemd-netif.service.
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-timesyncd[623]: Network configuration changed, trying to establish connection.
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-networkd[728]: ens5: Could not set DHCPv4 address: Connection timed out
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-networkd[728]: ens5: Failed
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-networkd[728]: ens5: Configured
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-timesyncd[623]: Synchronized to time server 169.254.169.123:123 (169.254.169.123).
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-timesyncd[623]: Network configuration changed, trying to establish connection.
    Jan 28 02:59:51 ip-172-xxx-xxx-xxx systemd-timesyncd[623]: Synchronized to time server 169.254.169.123:123 (169.254.169.123).
    Jan 28 03:00:01 ip-172-xxx-xxx-xxx systemd[1]: Started ntp-systemd-netif.service.
    Jan 28 03:01:21 ip-172-xxx-xxx-xxx systemd-udevd[503]: seq 16407 '/kernel/slab/proc_inode_cache/cgroup/proc_inode_cache(4935:ntp-systemd-netif.service)' is taking a long time
    Jan 28 03:01:28 ip-172-xxx-xxx-xxx systemd-udevd[503]: seq 16408 '/kernel/slab/:A-0000040/cgroup/pde_opener(4935:ntp-systemd-netif.service)' is taking a long time
    Jan 28 03:01:34 ip-172-xxx-xxx-xxx systemd-udevd[503]: seq 16409 '/kernel/slab/kmalloc-32/cgroup/kmalloc-32(4935:ntp-systemd-netif.service)' is taking a long time
    Jan 28 03:01:40 ip-172-xxx-xxx-xxx systemd-udevd[503]: seq 16410 '/kernel/slab/kmalloc-4k/cgroup/kmalloc-4k(4935:ntp-systemd-netif.service)' is taking a long time
    Jan 28 03:17:03 ip-172-xxx-xxx-xxx CRON[18284]: pam_unix(cron:session): session opened for user root by (uid=0)
    Jan 28 03:17:12 ip-172-xxx-xxx-xxx CRON[18285]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
    Jan 28 03:19:34 ip-172-xxx-xxx-xxx snapd[6419]: autorefresh.go:530: Cannot prepare auto-refresh change: Post https://api.snapcraft.io/v2/snaps/refresh: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Jan 28 03:19:34 ip-172-xxx-xxx-xxx CRON[18284]: pam_unix(cron:session): session closed for user root
    Jan 28 03:28:44 ip-172-xxx-xxx-xxx snapd[6419]: stateengine.go:149: state ensure error: Post https://api.snapcraft.io/v2/snaps/refresh: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Jan 28 03:36:35 ip-172-xxx-xxx-xxx systemd[1]: Starting Ubuntu Advantage Timer for running repeated jobs...
    Jan 28 04:01:18 ip-172-xxx-xxx-xxx systemd[1]: Started ntp-systemd-netif.service.
    Jan 28 04:03:09 ip-172-xxx-xxx-xxx systemd-udevd[503]: seq 16496 '/radix_tree_node(4961:ntp-systemd-netif.service)' is taking a long time
    Jan 28 04:04:00 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:06:13 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:06:26 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:09:14 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:09:26 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:12:15 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:12:26 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:12:36 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:15:15 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:15:26 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:15:34 ip-172-xxx-xxx-xxx kernel: nfs: server fs-0ac698ea1xxxxxxxx.efs.sa-east-1.amazonaws.com not responding, timed out
    Jan 28 04:16:39 ip-172-xxx-xxx-xxx sshd[4657]: pam_unix(sshd:session): session closed for user ubuntu
    Jan 28 04:17:30 ip-172-xxx-xxx-xxx systemd-logind[974]: Failed to abandon session scope, ignoring: Connection timed out
    Jan 28 04:18:00 ip-172-xxx-xxx-xxx systemd-logind[974]: Removed session 27.

Cloud Watch Graphs

Thanks!

已提问 2 年前1009 查看次数
1 回答
1

I just found this and hope this response helps. Generally instance status check failure usually occurs due to the following reasons:

  • Startup configuration
  • Exhausted memory
  • Corrupted file system
  • Incompatible kernel (Operating System itself)
  • Some script causing the issue

The error " Could not set DHCPv4 address" is not a cause but a symptom of Connectivity issues and just shows the instance was unable to communicate outbound to renew its dhcp lease. What it does not say is what was blocking the connection. The same blockage would have likely prevented healthchecks from succeeding hence the status check failure.

Most common cause of the issue is an Internal Firewall. From my experience, I have seen a few dhcp lease renewal failures where servers are running Trend DSA. You can verify if there is any security agent installed on the OS.

If the system logs like syslog are still available, I would also recommend checking to see if they contain any errors/logs indicating memory exhaustion?

The Cloudwatch logs might no longer be available due to time, however, checking to see if there was a spike in Network In/Out traffic around the time of the status check failures can also be a good indicator of why status checks failed.

AWS
已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则