EC2 Instance failure (about system logs and state hanging)

0

Hi

Recently we've been having issue with EC2 (Debian Linux) instances sometimes failing their "Instance Status Check". According to the KB (link) that should mean the issue is related to the OS. This leads to three questions:

  1. When trying to shut down the instance (since SSH can't reach it), it seems to just hang. Force stopping also doesn't seem to actually force the stop. I've tried looking at the docs, but the wording seems unclear to me. Does If, after 10 minutes, the instance has not stopped, post a request for help mean that force-stop is expected to take 10 minutes?
  2. Logs. After recovering the instance, the on-disk logs doesn't show any signs of trouble. The journald entry for that boot just stops suddenly, indicating a very hard crash. Our regular monitoring does not show any signs of CPU (5% usage) or memory (70% usage) exhaustion. Of course, if it's a kernel panic, the logs on disk will not show anything, so we tried looking at the system logs/console output after the instance was recovered, but the text before boot was already scrolled out. The command was aws ec2 get-console-output --instance-id i-0c48fd2b6b069df76. I looked at the docs and help function, but couldn't find a way to scroll backwards. Is there a way to do this?
  3. Additionally is there a way to log these (the console output/system log) to a cloudwatch log-stream so they become permanent? Or some other way of achieving the same goal?

Additional suggestions are welcome. I'm trying to figure out both what happened at that last crash, but also find out how to persist the information if it happens again.

Thank you. Martin

Martin
asked 10 days ago96 views
1 Answer
1
Accepted Answer

Hello.

When trying to shut down the instance (since SSH can't reach it), it seems to just hang. Force stopping also doesn't seem to actually force the stop. I've tried looking at the docs, but the wording seems unclear to me. Does If, after 10 minutes, the instance has not stopped, post a request for help mean that force-stop is expected to take 10 minutes?

Normally, it takes less than 10 minutes to force a stop.
If it takes more than 10 minutes, there is a possibility that an abnormality has occurred and you need to contact AWS Support or request help from AWS re:Post.

Logs. After recovering the instance, the on-disk logs doesn't show any signs of trouble. The journald entry for that boot just stops suddenly, indicating a very hard crash. Our regular monitoring does not show any signs of CPU (5% usage) or memory (70% usage) exhaustion. Of course, if it's a kernel panic, the logs on disk will not show anything, so we tried looking at the system logs/console output after the instance was recovered, but the text before boot was already scrolled out. The command was aws ec2 get-console-output --instance-id i-0c48fd2b6b069df76. I looked at the docs and help function, but couldn't find a way to scroll backwards. Is there a way to do this?

Even if you follow the steps in the document below from the AWS Management Console, are you still unable to see old logs?
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshoot-unreachable-instance.html#instance-console-console-output

Additionally is there a way to log these (the console output/system log) to a cloudwatch log-stream so they become permanent? Or some other way of achieving the same goal?

Depending on your OS, you can install CloudWatch Agent and output logs to CloudWatch Logs.
You can install CloudWatch Agent by following the steps in the document below.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-on-EC2-Instance-fleet.html

Also, as an example of log output settings, the following settings will output "/var/log/messages" to CloudWatch Logs.
Please note that the system log file path varies depending on the OS.
If you are using Amaozn Linux 2023, by installing rsyslog, system logs will be output to "/var/log/messages".

{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/messages",
            "log_group_name": "syslog",
            "log_stream_name": "{instance_id}",
            "retention_in_days": 30
          }
        ]
      }
    }
  }
}
profile picture
EXPERT
answered 10 days ago
  • Hi Riku! Thanks for chiming in. :)

    Normally, it takes less than 10 minutes to force a stop.

    It didn't take a full 10 minutes, but it was several minutes, for a server than can usually do a controlled reboot in under a minute. So was just wondering.

    Even if you follow the steps in the document below from the AWS Management Console, are you still unable to see old logs?

    Yes, the Debian + Cloud Init process puts too much text into the log, so IF there is anything, it scrolls out of the buffer you can fetch there. I've been experimenting a bit with crashing another instance in purpose and it looks like it never shows up in the system-log at all. Interestingly, it does show up in the "screenshot", but there is very limited room for text there.

    EDIT: I experimented a bit, and if you let the system hang for some unspecified, but long period of time without attempting to recover it. The kernel panic will eventually show in the "system log" in AWS.

    Depending on your OS, you can install CloudWatch Agent and output logs to CloudWatch Logs.

    Cloudwatch agent runs inside the guest/instance OS, so it will have the same limitations as our current log collector. If the kernel crashes that service won't be able to either collect or upload.


    I think the current conclusion has to be check the screenshot if we catch it as it happens, and otherwise rely on kernel crash-dumps.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions