EC2 instance rebooted automatically.

0

I'm running an EC2 instance of type 't2.micro'. This instance got rebooted twice between 19:30 and 19:36 on 21st August 2024. I have confirmed that there was no spike in resource utilization before the reboot. Also, I didn't find any log entries related to OOM.

However I got the following log entries around the time of reboots:-

HTTPD error log

mmap() failed: [12] Cannot allocate memory

Kernel and ssm-agent messages

./messages-20240825:Aug 21 19:30:28 mafiree kernel: NMI watchdog: Perf event create on CPU 0 failed with -2 ./messages-20240825:Aug 21 19:30:28 mafiree kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM ./messages-20240825:Aug 21 19:30:28 mafiree kernel: acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. ./messages-20240825:Aug 21 19:30:32 mafiree rngd: Failed to init entropy source hwrng ./messages-20240825:Aug 21 19:30:36 mafiree amazon-ssm-agent: 2024/08/21 19:30:36 Failed to load instance info from vault. RegistrationKey does not exist. ./messages-20240825:Aug 21 19:35:36 mafiree amazon-ssm-agent: 2024-08-21 19:35:36 ERROR Health ping failed with error - NoCredentialProviders: no valid providers in chain. Deprecated. ./messages-20240825:Aug 21 19:35:36 mafiree amazon-ssm-agent: 2024-08-21 19:35:36 ERROR Health ping failed with error - NoCredentialProviders: no valid providers in chain. Deprecated. ./messages-20240825:Aug 21 19:36:11 mafiree kernel: NMI watchdog: Perf event create on CPU 0 failed with -2 ./messages-20240825:Aug 21 19:36:11 mafiree kernel: acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM ./messages-20240825:Aug 21 19:36:11 mafiree kernel: acpi PNP0A03:00: fail to add MMCONFIG information, can't access extended PCI configuration space under this bridge. ./messages-20240825:Aug 21 19:36:15 mafiree rngd: Failed to init entropy source hwrng ./messages-20240825:Aug 21 19:36:19 mafiree amazon-ssm-agent: 2024/08/21 19:36:19 Failed to load instance info from vault. RegistrationKey does not exist.

Also, panic_on_oom kernel parameter is set to 1.

If HTTPD couldn't allocate memory, then it is possible that the HTTPD service could be restarted. But in this case, the instance itself rebooted. So I suspect any hardware or kernel problems might be responsible for the reboot. I need to know the actual reason for the reboot with evidence and how to avoid it in the future (if this is a concern of system admin).

Thanks in advance.

asked a month ago49 views
1 Answer
0

Analysis of Logs

Memory Allocation Failure:

The mmap() failed: [12] Cannot allocate memory error in the HTTPD logs indicates that the server attempted to allocate memory but was unable to do so. However, this alone doesn't typically cause a system reboot; it might cause the HTTPD service to crash or restart.

Kernel Logs:

NMI watchdog: Perf event create on CPU 0 failed with -2: This error might indicate a problem with the performance monitoring system on the CPU. The NMI (Non-Maskable Interrupt) watchdog can trigger a system reboot if it detects that the system is hung or unresponsive.

acpi PNP0A03:00: _OSC failed (AE_NOT_FOUND); disabling ASPM: The ACPI (Advanced Configuration and Power Interface) errors suggest that the kernel encountered issues with power management and PCI configuration. This could indicate a deeper issue with the hardware or a problem in the ACPI configuration within the kernel.

SSM Agent Errors:

The errors from amazon-ssm-agent about failing to load instance info and health pings failing might indicate that the instance was rebooting and the agent was unable to communicate with the AWS back-end services. However, these errors are likely a symptom of the reboot rather than the cause.

OOM (Out of Memory) Handling:

The panic_on_oom kernel parameter is set to 1, meaning that the system will panic and reboot if an out-of-memory (OOM) condition occurs. Even though there were no explicit OOM messages in the logs, the HTTPD memory allocation failure could suggest a low-memory situation that triggered a kernel panic. Determining the Cause

To determine the root cause, you can take the following steps:

Check CloudWatch Logs and Metrics:

Review the CloudWatch logs for any System Reboot events and detailed EC2 instance status checks. Look for any System Status Check or Instance Status Check failures around the time of the reboots.

Inspect System Logs (dmesg):

Check the dmesg logs for any messages leading up to the reboot. Look for any kernel panics, OOM killer activity, or hardware-related errors.

AWS EC2 Monitoring:

Review the AWS EC2 monitoring metrics for the instance. Look for any unusual CPU, memory, or I/O activity just before the reboot. Potential Causes and Solutions

Kernel or Hardware Issue:

The ACPI errors and NMI watchdog failures point towards potential hardware issues or kernel bugs. You may want to update the kernel to the latest stable version available for your OS. If the problem persists, consider launching a new instance on different hardware.

Out of Memory (OOM):

If the instance ran out of memory, the kernel's panic_on_oom setting would cause a reboot. Consider increasing the instance size or optimizing your application's memory usage. You could also disable the panic_on_oom parameter if frequent OOM situations are not expected.

AWS Infrastructure Maintenance:

Sometimes, AWS may automatically reboot instances for maintenance purposes. Check the Event History in the EC2 console for any scheduled maintenance activities around the time of the reboot. Preventive Measures

Instance Size Adjustment:

If memory allocation failures are common, consider upgrading to an instance with more memory, such as a t3.micro or t3.small instance.

Monitoring and Alerts:

Set up CloudWatch alarms for memory, CPU, and disk utilization. This can alert you before a resource bottleneck leads to a reboot.

Kernel Update:

Ensure that your instance's kernel and all related packages are up-to-date to prevent kernel-level issues.

Application Optimization:

Optimize your HTTPD server configuration to handle memory more efficiently. Consider tuning the number of worker threads/processes and reducing memory usage per worker.

EXPERT
answered a month ago
profile picture
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions