I assume that when you say "the whole instance stops responding" that you can't SSH to it and you receive no further logging. Other than increasing the logging that you're doing to find out what's going on I can only recommend two things:
- Use the EC2 Serial Console to access the instance.
- Raise a support case to investigate further - the support team can look deeper into the actual instances and physical machines that you are running.
Hello @ksarpatil17,
Hope you are doing well.
I see that your instance is being unresponsive intermittently. Also, the instance becomes responsive when you stop and start the instance. Hope I have your issue right, if not, please feel free to correct me.
Considering that the instance becomes responsive after a stop-start, it is highly possible that the instance is experiencing a resource crunch. It is important to note that Cloudwatch only monitors these metrics: [+] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html#ec2-cloudwatch-metrics
However, the instance can become unresponsive not only for CPU utilisation hitting 100% but also due to memory crunch and IOPs on the EBS volume. Unfortunately, these metrics are not available by default in Cloudwatch. You will have to use the Cloudwatch agent to send these custom metrics to cloudwatch. More on this here: [+] https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
If you havent already enabled it, you can try reviewing your instance's system log. The steps to access system log are detailed here: [+] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html#instance-console-console-output
- If there are Memory related issue, you will usually be able to observe an error message like, but not exactly:
/var/log/kern.log.3.gz:1967:Jun 13 14:41:02 cheetah kernel: [347531.586344] Killed process 98254 (chrome) total-vm:889684kB, anon-rss:113068kB, file-rss:0kB, shmem-rss:1784kB
/var/log/kern.log.3.gz:2861:Jun 13 14:41:22 cheetah kernel: [347551.430284] Killed process 102503 (chrome) total-vm:911152kB, anon-rss:104748kB, file-rss:0kB, shmem-rss:1968kB
If you see such errors, there might be process that are consuming your memory. Some possible troubleshooting steps are: [+] https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
If you do not see any such errors, I recommend you check for EBS hitting the max IOPs limit. There is an IOPs limit for different volume type. Assuming that you use a GP2 EBS volume, the volume starts with a default of 100 IOPs and increases by 3 for every GB increase in volume size.
You can calculate the IOPs using the metrics available from Cloudwatch by default using the formula: IOPs = (VolumeReadOps + VolumeWriteOps)/(Total time in minutes * 60 seconds)
More details on calculating the IOPs here: [+] https://onica.com/blog/managed-services/calculate-aws-ebs-volume-iops/
If the EBS volume is hitting the IOPs limit, please scale up the volume size to increase IOPs, opt to different type of EBS or make configuration changes in your application to lower the IOPs.
If you do not see any of the above issue, please feel free to reach out to the AWS Premium Support to dive deep on this.
Hope it helps.
Regards, Harshavardhan Gowda
Relevant questions
AWS Solutions- AI-Driven Social Media Dashboard Solutions Implementation
asked 8 months agoDeploy Rest API in Django
Accepted Answerasked 6 months agoDeploying a Machine Learning Project with django and laravel as a backend.
asked 3 months agoHow to delete an active instance in the detector model?
Accepted Answerasked 7 months agoCustom language model not showing for real-time transcription
Accepted Answerasked 7 months agoSageMaker Model Registry - how to set the Stage column of a Model Package?
asked 8 months agoEC2 Instance stops working after some time
asked a month agoAccessing SageMaker model after registering it
Accepted Answerasked 5 months agoHow to register a multi container model to a model registry?
asked 2 months agoX-Ray dropping the tracing logs after some time
asked 3 months ago
Thank you for the answer and yes I can't even SSH to it. I tried with EC2 Serial console but even by that, I was not able to get into the instance. Whenever this thing happens I stop the instance and start the instance to keep my server running. I will raise a support case now.