Wordpress Running on EC2 goes down

0

Hi

My website is hosted on an EC2 instance. I've noticed that the website becomes unreachable for 15 to 40 minutes every day. I'm unsure about the cause of this issue with the EC2 instance.

EC2 instance running on c5a instance type

Please review the following /var/log/messages logs from the outage window and provide your advice:

Nov 19 02:46:18 server amazon-ssm-agent[901]: 2024-11-19 02:46:18 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 143944653589
Nov 19 02:46:18 server amazon-ssm-agent[901]: #011status code: 400, request id: ff7618c6-8bfd-4259-b543-dfc8672deeb9
Nov 19 02:47:01 server systemd[1]: Created slice User Slice of UID 0.
Nov 19 02:47:01 server systemd[1]: Starting User Runtime Directory /run/user/0...
Nov 19 02:47:01 server systemd[1]: Finished User Runtime Directory /run/user/0.
Nov 19 02:47:01 server systemd[1]: Starting User Manager for UID 0...
Nov 19 02:47:02 server systemd[328041]: Queued start job for default target Main User Target.
Nov 19 02:47:02 server systemd[328041]: Created slice User Application Slice.
Nov 19 02:47:02 server systemd[328041]: Mark boot as successful after the user session has run 2 minutes was skipped because of an unmet condition check (ConditionUser=!@system).
Nov 19 02:47:02 server systemd[328041]: Started Daily Cleanup of User's Temporary Directories.
Nov 19 02:47:02 server systemd[328041]: Reached target Paths.
Nov 19 02:47:02 server systemd[328041]: Reached target Timers.
Nov 19 02:47:02 server systemd[328041]: Starting D-Bus User Message Bus Socket...


Nov 19 02:48:35 server systemd[1]: Stopped User Runtime Directory /run/user/0.
Nov 19 02:48:35 server systemd[1]: Removed slice User Slice of UID 0.
Nov 19 02:48:35 server systemd[1]: user-0.slice: Consumed 1.008s CPU time.
Nov 19 02:50:01 server systemd[1]: Starting system activity accounting tool...
Nov 19 02:50:01 server systemd[1]: sysstat-collect.service: Deactivated successfully.
Nov 19 02:50:01 server systemd[1]: Finished system activity accounting tool.
Nov 19 02:50:01 server systemd[1]: Created slice User Slice of UID 0.
Nov 19 02:50:01 server systemd[1]: Starting User Runtime Directory /run/user/0...
Nov 19 02:50:01 server systemd[1]: Finished User Runtime Directory /run/user/0.
Nov 19 02:50:01 server systemd[1]: Starting User Manager for UID 0...
Nov 19 02:50:01 server systemd[328312]: Queued start job for default target Main User Target.
Nov 19 02:50:01 server systemd[328312]: Created slice User Application Slice.
Nov 19 02:50:01 server systemd[328312]: Mark boot as successful after the user session has run 2 minutes was skipped because of an unmet condition check (ConditionUser=!@system).
Nov 19 02:50:01 server systemd[328312]: Started Daily Cleanup of User's Temporary Directories.
Nov 19 02:50:01 server systemd[328312]: Reached target Paths.
Nov 19 02:50:01 server systemd[328312]: Reached target Timers.
Nov 19 02:50:01 server systemd[328312]: Starting D-Bus User Message Bus Socket...
Nov 19 02:50:01 server systemd[328312]: PipeWire PulseAudio was skipped because of an unmet condition check (ConditionUser=!root).
Nov 19 02:50:01 server systemd[328312]: Listening on PipeWire Multimedia System Sockets.
Nov 19 02:50:01 server systemd[328312]: Starting Create User's Volatile Files and Directories...
Nov 19 02:50:01 server systemd[328312]: Finished Create User's Volatile Files and Directories.
Nov 19 02:50:01 server systemd[328312]: Listening on D-Bus User Message Bus Socket.
Nov 19 02:50:01 server systemd[328312]: Reached target Sockets.
Nov 19 02:50:01 server systemd[328312]: Reached target Basic System.
Nov 19 02:50:01 server systemd[328312]: Reached target Main User Target.
Nov 19 02:50:01 server systemd[1]: Started User Manager for UID 0.
Nov 19 02:50:01 server systemd[328312]: Startup finished in 182ms.
Nov 19 02:50:01 server systemd[1]: Started Session 3289 of User root.
Nov 19 02:50:01 server systemd[1]: Started Session 3290 of User root.
Nov 19 02:50:01 server systemd[1]: session-3290.scope: Deactivated successfully.
Nov 19 02:50:05 server systemd[1]: session-3289.scope: Deactivated successfully.
Nov 19 02:50:15 server systemd[1]: Stopping User Manager for UID 0...
Nov 19 02:50:15 server systemd[328312]: Activating special unit Exit the Session...
Nov 19 03:00:15 server systemd[328892]: Removed slice User Application Slice.
Nov 19 03:00:15 server systemd[328892]: Reached target Shutdown.
Nov 19 03:00:15 server systemd[328892]: Finished Exit the Session.
Nov 19 03:00:15 server systemd[328892]: Reached target Exit the Session.
Nov 19 03:00:15 server systemd[1]: user@0.service: Deactivated successfully.
Nov 19 03:00:15 server systemd[1]: Stopped User Manager for UID 0.
Nov 19 03:00:15 server systemd[1]: Stopping User Runtime Directory /run/user/0...
Nov 19 03:00:15 server systemd[1]: run-user-0.mount: Deactivated successfully.
Nov 19 03:00:15 server systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Nov 19 03:00:15 server systemd[1]: Stopped User Runtime Directory /run/user/0.
Nov 19 03:00:15 server systemd[1]: Removed slice User Slice of UID 0.
Nov 19 03:02:01 server systemd[1]: Created slice User Slice of UID 0.
Nov 19 03:02:01 server systemd[1]: Starting User Runtime Directory /run/user/0...
Nov 19 03:02:01 server systemd[1]: Finished User Runtime Directory /run/user/0.
Nov 19 03:02:01 server systemd[1]: Starting User Manager for UID 0...
Nov 19 03:02:02 server systemd[328979]: Queued start job for default target Main User Target.
Nov 19 03:02:02 server systemd[328979]: Created slice User Application Slice.
Nov 19 03:02:02 server systemd[328979]: Mark boot as successful after the user session has run 2 minutes was skipped because of an unmet condition check (ConditionUser=!@system).
Nov 19 03:02:02 server systemd[328979]: Started Daily Cleanup of User's Temporary Directories.
Nov 19 03:02:02 server systemd[328979]: Reached target Paths.
Nov 19 03:02:02 server systemd[328979]: Reached target Timers.
Nov 19 03:02:02 server systemd[328979]: Starting D-Bus User Message Bus Socket...
Nov 19 03:02:02 server systemd[328979]: PipeWire PulseAudio was skipped because of an unmet condition check (ConditionUser=!root).
Nov 19 03:02:02 server systemd[328979]: Listening on PipeWire Multimedia System Sockets.
Nov 19 03:02:02 server systemd[328979]: Starting Create User's Volatile Files and Directories...
Nov 19 03:02:02 server systemd[328979]: Listening on D-Bus User Message Bus Socket.
Nov 19 03:02:02 server systemd[328979]: Reached target Sockets.
Nov 19 03:02:02 server systemd[328979]: Finished Create User's Volatile Files and Directories.
Nov 19 03:02:02 server systemd[328979]: Reached target Basic System.
Nov 19 03:02:02 server systemd[1]: Started User Manager for UID 0.
Nov 19 03:02:02 server systemd[1]: Started Session 3296 of User root.
Nov 19 03:02:02 server systemd[328979]: Reached target Main User Target.
Nov 19 03:02:02 server systemd[328979]: Startup finished in 174ms.

asked 2 months ago57 views
2 Answers
1

Based on the provided logs, there are a few issues that could be contributing to your EC2 instance's intermittent downtime:

  1. Systems Manager Role Issue: There's an error indicating that the Systems Manager's instance management role is not configured for your account. This could prevent proper management and monitoring of your instance.

  2. User Session Management: The logs show frequent starting and stopping of user sessions, particularly for the root user. This constant creation and removal of user slices could potentially impact system resources.

  3. Service Restarts: There are multiple entries showing services being started and stopped, which could indicate instability in some system components.

  4. Resource Consumption: While not explicitly shown in these logs, the c5a instance type you're using is not known for having CPU credit issues like T2 or T3 instances. However, it's still worth monitoring your resource usage to ensure you're not hitting any limits.

To address these issues and improve your EC2 instance's stability:

  1. Configure the Systems Manager instance management role correctly for your account to ensure proper monitoring and management capabilities.

  2. Investigate why there are frequent user session creations and removals, especially for the root user. This might indicate an underlying issue or misconfiguration.

  3. Monitor your instance's resource usage (CPU, memory, disk I/O) using Amazon CloudWatch to identify any potential bottlenecks or resource constraints.

  4. Review your application logs (e.g., Apache/Nginx, PHP, MySQL) for any errors or issues that might be causing the web server to become unresponsive.

  5. Consider setting up auto-recovery for your EC2 instance to automatically recover it if it becomes impaired.

  6. Implement a robust monitoring and alerting system to notify you of any issues before they lead to extended downtime.

If the problem persists after addressing these points, you may need to perform a more in-depth analysis of your application and server configuration or consider reaching out to AWS Support for further assistance.
Sources
Wordpress EC2 instance stops working regularly | AWS re:Post
EC2 wordpress is not loading | AWS re:Post
EC2 instance unreachable while it shows running fine | AWS re:Post
EC2 instance is not reachable at all | AWS re:Post
I have a problem with ec2 instance. | AWS re:Post

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
0

Based on the logs you've shared, here's a summary of potential issues and steps you can take to troubleshoot the intermittent downtime on your EC2 instance:

Key Log Entries:

  1. SSM Agent Access Denied:

    Nov 19 02:46:18 server amazon-ssm-agent[901]: ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account.
    
    • The SSM agent on your EC2 instance is failing to authenticate because the SSM role is not correctly configured. While this may not directly cause downtime, if your EC2 relies on SSM for management tasks (e.g., automatic scaling, patching), this could be contributing to the issue. Ensure that the EC2 instance role has the correct permissions attached to use SSM.

    • Action: Verify that your EC2 instance has the correct IAM role with AmazonSSMManagedInstanceCore policy attached, and ensure that Systems Manager is properly configured.

  2. Systemd Messages: The logs show repeated systemd messages related to starting and stopping user sessions, but these don't directly indicate downtime. It seems like your instance is restarting certain services (user@0.service), which could cause temporary unavailability.

  3. User Session Shutdown: The log shows several instances of user sessions being stopped:

    Nov 19 02:50:15 server systemd[1]: Stopping User Manager for UID 0...
    

    This may indicate some process is stopping or restarting system services, potentially causing the web server or application to be unavailable.

    • Action: Check if any cron jobs, system updates, or other automated processes are stopping services around this time (e.g., from 2:45 AM to 3:00 AM). Review the system's cron logs and scheduled tasks to identify any automated processes that could be interfering with the web server.
  4. Verify Resource Utilization:

    • CPU/Memory: High CPU or memory usage can cause the instance to become unresponsive. Check your EC2 instance’s CPU and memory metrics in CloudWatch to rule out resource exhaustion during the outage window.

    • Action: Enable detailed monitoring on the instance to check for resource spikes during the time of the issue.

  5. Web Server Logs:

    • Review your web server logs (e.g., Nginx, Apache) for any errors or restarts during the downtime period. This might give clues if the server itself is crashing or being restarted.

Additional Steps to Take:

  • EC2 Instance Role Permissions: Double-check IAM role and permissions.
  • Automated Processes: Check for scheduled tasks (cron jobs, etc.) causing system downtime.
  • Check Web Server: Ensure the web server isn't restarting or encountering errors during that time.

References:

If you continue facing issues, consider investigating further into any system-level updates or patching schedules that might be affecting the instance’s stability.

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions