EC2 Instance stops working after some time

0

Hello All,

I have deployed an AI model within a Django application, It has just one rest API which activates the AI model. It works perfectly fine for some time but the whole instance stops responding after a while.

I am using a c5.4xlarge instance and the CPU percentage in cloud watch is 11-12% max, I have tried both ways with docker and without docker the condition is the same.

Please help me with this....

2 Answers
1

I assume that when you say "the whole instance stops responding" that you can't SSH to it and you receive no further logging. Other than increasing the logging that you're doing to find out what's going on I can only recommend two things:

  1. Use the EC2 Serial Console to access the instance.
  2. Raise a support case to investigate further - the support team can look deeper into the actual instances and physical machines that you are running.
profile pictureAWS
EXPERT
answered 2 years ago
  • Thank you for the answer and yes I can't even SSH to it. I tried with EC2 Serial console but even by that, I was not able to get into the instance. Whenever this thing happens I stop the instance and start the instance to keep my server running. I will raise a support case now.

1

Hello @ksarpatil17,

Hope you are doing well.

I see that your instance is being unresponsive intermittently. Also, the instance becomes responsive when you stop and start the instance. Hope I have your issue right, if not, please feel free to correct me.

Considering that the instance becomes responsive after a stop-start, it is highly possible that the instance is experiencing a resource crunch. It is important to note that Cloudwatch only monitors these metrics: [+] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html#ec2-cloudwatch-metrics

However, the instance can become unresponsive not only for CPU utilisation hitting 100% but also due to memory crunch and IOPs on the EBS volume. Unfortunately, these metrics are not available by default in Cloudwatch. You will have to use the Cloudwatch agent to send these custom metrics to cloudwatch. More on this here: [+] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html

If you havent already enabled it, you can try reviewing your instance's system log. The steps to access system log are detailed here: [+] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html#instance-console-console-output

  • If there are Memory related issue, you will usually be able to observe an error message like, but not exactly:
/var/log/kern.log.3.gz:1967:Jun 13 14:41:02 cheetah kernel: [347531.586344] Killed process 98254 (chrome) total-vm:889684kB, anon-rss:113068kB, file-rss:0kB, shmem-rss:1784kB
/var/log/kern.log.3.gz:2861:Jun 13 14:41:22 cheetah kernel: [347551.430284] Killed process 102503 (chrome) total-vm:911152kB, anon-rss:104748kB, file-rss:0kB, shmem-rss:1968kB

If you see such errors, there might be process that are consuming your memory. Some possible troubleshooting steps are: [+] https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/

If you do not see any such errors, I recommend you check for EBS hitting the max IOPs limit. There is an IOPs limit for different volume type. Assuming that you use a GP2 EBS volume, the volume starts with a default of 100 IOPs and increases by 3 for every GB increase in volume size.

You can calculate the IOPs using the metrics available from Cloudwatch by default using the formula: IOPs = (VolumeReadOps + VolumeWriteOps)/(Total time in minutes * 60 seconds)

More details on calculating the IOPs here: [+] https://onica.com/blog/managed-services/calculate-aws-ebs-volume-iops/

If the EBS volume is hitting the IOPs limit, please scale up the volume size to increase IOPs, opt to different type of EBS or make configuration changes in your application to lower the IOPs.

If you do not see any of the above issue, please feel free to reach out to the AWS Premium Support to dive deep on this.

Hope it helps.

Regards, Harshavardhan Gowda

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions