- Newest
- Most votes
- Most comments
You can do so by putting load on CPU or by exhausting instance memory. There are various things, you can do to get EC2 instance status health check failed:
- Exhausted memory
- File system issues
- CPU exhaustion
If you just want to simulate that, you can pick a very low configuration instance, setup cloudwatch as what you did, then stress your CPU or memory to get instance status health check failed:
Here is how you can generate load on CPU.
Additional reference:
Hope this helps.
Comment here if you have additional questions, happy to help.
Abhishek
When you say you defined your threshold as "StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes", I assume that means you have set a period of 5 minutes. This implies that every minute, the alarm aggregates the 5 latest minutes of data with the statistic you have chosen, and evaluates the result against the threshold.
Let's take an example: if from 9:00am to 9:04am your CPU is at 50%, then from 9:04 to 9:05 your CPU jumps at 99.7%:
When the alarm evaluates the 9:05am datapoint, it retrieves all the data from 9:00 to 9:05 because you have set a period of 5 minutes, and it aggregates this data with the statistic you have set. If you set MAX as a statistic, then the value should be 99.7%. But if you set average as a statistic, then the value is 59.94% (the average of the 5 values corresponding to every minute: 50, 50, 50, 50, 99.7).
This implies that if you define your threshold as "StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes" using the average statistic, the alarm will trigger only if the CPU remains at very high usage for the whole 5 minutes (if 4 datapoints are equal to 100%, the 5th would need to be at least 95% for the average to be 99%). This alarm would detect only a very high sustained stress on your CPU. If that is not your intention, you may need to modify the definition of your alarm. There are several options, depending on what you want to do.
For example, secondabhi_aws suggests you may want to review your period to trigger the alarm for 1 datapoint in 1 minute. This would fail fast (as soon as the CPU is above 99% for 1 minute) an may be what you are looking for. Another way to do it would be to keep the 5 minutes but use a different statistic, e.g. maximum in your case. The difference with the previous solution is that the alarm would take longer to return to OK and would be less likely to flip between OK and ALARM if your CPU usage varies quickly around the limit.
However both cases may react to peaks.
If you expect peaks and prefer to detect a steady increase, you could lower the 0.99 threshold value and assess what is the average CPU usage that you want to alarm on instead.
Relevant content
- asked 3 years ago
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
I generated the load on CPU as per document, The is showing 99.7%, but I don't know why the status check is not failing. Please check If I have set the correct threshold for the alarm to trigger
StatusCheckFailed StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes
Keep it like StatusCheckFailed >= 1 for 1 datapoints within 1 minute.