AWS EC2 Instance going down unexpectedly

0

I have an internet facing Web Application running in an AWS EC2 m5a.xlarge instance in Mumbai region. The service is running as a daemon service ( sudo systemctl start service_name ) which executed through a unix/perl scripting. I noticed everyday, the EC2 instance is going down with a message in the console (1/2 health check passed). Then I had to reboot the instance manually from the console, followed by restarting the service. The security group rule attached to the EC2 instance is open to all ports. Is it because of that ? Kindly suggest. Thanks!

asked 9 months ago401 views
2 Answers
1
Accepted Answer

Answering your last question first, the security group rules wouldn't cause your EC2 to crash, although every port being open would be helpful to a bad actor who wants to DDOS your Linux host. You shouldn't need all ports to be open, could you limit it to just those that are needed (you will know better than me what these will be).

You mentioned that each time you've rebooted the instance you have to manually restart the service, which sounds unusual for a service that is intended to run as a daemon. Is this expected, or is there something on the Linux host that is preventing it from starting automatically (presence of a lock file perhaps, or something like that)?

I take it that as the EC2 doesn't actually crash or panic there won't a dump file that would assist with root cause analysis. When you have to go in and restart the instance in AWS Console, can you go to the Monitoring tab to check if there is anything that would cause concern in there, such as excessive CPU consumption, or the network being saturated?

It would advise setting up CloudWatch agent to collect more detailed system logs https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html - this may show that your root cause is exhaustion of some system resources.

profile picture
EXPERT
Steve_M
answered 9 months ago
  • Thanks Steve. Able to get the system log & it shows an OOM. But it was during off peak hour during 27-Jul-2023 02:30 to 03:30 UTC. Here is the system log which shows OOM error.

    [29361.704649] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/my_svc.service,task=java,pid=2584,uid=0 [29361.707472] Out of memory: Killed process 2584 (java) total-vm:20205608kB, anon-rss:9132500kB, file-rss:0kB, shmem-rss:32kB, UID:0 pgtables:18804kB oom_score_adj:0 [242840.920963] G1 Young RemSet invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0 [242840.922929] CPU: 3 PID: 114623 Comm: G1 Young RemSet Not tainted 6.1.34-59.116.amzn2023.x86_64 #1

  • Stating the obvious here, OOM means you're out of memory. Even if it was outside of a peak time the system will still be consuming some resources just to stay up and running, and it may be the case that memory use may have increased in the hours leading up to this and that memory was never freed up, and/or something that you're running has a memory leak, and/or something else (I notice it's a java process mentioned in the logfile extract, so the garbage collection settings would be a good place to look).

    Did you manage to install CloudWatch agent, and did this offer any more details?

    It may be as simple as the workload isn't suited for an m5a.xlarge and you need to look at an uplift to a new instance type.

  • That's right Steve. Appreciate your help and support. I am tuning my application JVM config settings as per the machine config. Looks like I am assigning 15gib to a service where as the total memory of my instance is 16GiB. Hence its resulting OOM. Thanks a lot for the pointer :)

  • You're welcome Santosh, glad I've helped.

    If this has got to the root cause could you accept my initial answer, as this will assist when other users with the same problem find this question in future.

1

Hi, you should get more details about which of your status checks is failing.

Basically, there are 2 types: System status checks and Instance status checks.

See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html for all details

Knowing which fails will allow you to get to the root cause of your problem: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html#viewing_status

Update based on Santhosh's comment:
since it's a reachability problem, you should follow this guidance to get root cause: https://repost.aws/knowledge-center/ec2-linux-status-check-failure

also get the system logs to see if something - for example, your started service - creates a problem reported in system logs. You can get them this way: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstances.html#troubleshooting-retrieve-system-logs

Best, Didier

profile pictureAWS
EXPERT
answered 9 months ago
  • Thanks Didier. The Instance status checks is failing on a daily basis. From the CPU utilization observed that, its bit high during some part of the day - but not always.

  • It says instance reachability check failed.

  • see my update in answer

  • The OOM mentionned below is probably the root cause of your issue: your Java code probably has some leaks. If not, you do not allocate enough memory to Java heap. Try to increase it in that case.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions