I want to forcefully simulate an status check failure on one of my EC2 instance to test status check metric alarm which I recently created.

0

I want to forcefully simulate an status check failure on one of my EC2 instance to test status check metric alarm which I recently created.
I want to test my cloudwatch alarm which has following properties: Type Metric alarm Description StatusCheckFailed - awsec2-i-0a61dbf476a41166f-GreaterThanOrEqualToThreshold Threshold StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes Metric name StatusCheckFailed

2 Answers
3
Accepted Answer

You can do so by putting load on CPU or by exhausting instance memory. There are various things, you can do to get EC2 instance status health check failed:

  • Exhausted memory
  • File system issues
  • CPU exhaustion

If you just want to simulate that, you can pick a very low configuration instance, setup cloudwatch as what you did, then stress your CPU or memory to get instance status health check failed:

Here is how you can generate load on CPU.

Additional reference:

Hope this helps.

Comment here if you have additional questions, happy to help.

Abhishek

profile pictureAWS
EXPERT
answered a year ago
profile pictureAWS
EXPERT
iBehr
reviewed a year ago
profile pictureAWS
EXPERT
reviewed a year ago
  • I generated the load on CPU as per document, The is showing 99.7%, but I don't know why the status check is not failing. Please check If I have set the correct threshold for the alarm to trigger

    StatusCheckFailed StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes

  • Keep it like StatusCheckFailed >= 1 for 1 datapoints within 1 minute.

0

When you say you defined your threshold as "StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes", I assume that means you have set a period of 5 minutes. This implies that every minute, the alarm aggregates the 5 latest minutes of data with the statistic you have chosen, and evaluates the result against the threshold.

Let's take an example: if from 9:00am to 9:04am your CPU is at 50%, then from 9:04 to 9:05 your CPU jumps at 99.7%:

When the alarm evaluates the 9:05am datapoint, it retrieves all the data from 9:00 to 9:05 because you have set a period of 5 minutes, and it aggregates this data with the statistic you have set. If you set MAX as a statistic, then the value should be 99.7%. But if you set average as a statistic, then the value is 59.94% (the average of the 5 values corresponding to every minute: 50, 50, 50, 50, 99.7).

This implies that if you define your threshold as "StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes" using the average statistic, the alarm will trigger only if the CPU remains at very high usage for the whole 5 minutes (if 4 datapoints are equal to 100%, the 5th would need to be at least 95% for the average to be 99%). This alarm would detect only a very high sustained stress on your CPU. If that is not your intention, you may need to modify the definition of your alarm. There are several options, depending on what you want to do.

For example, secondabhi_aws suggests you may want to review your period to trigger the alarm for 1 datapoint in 1 minute. This would fail fast (as soon as the CPU is above 99% for 1 minute) an may be what you are looking for. Another way to do it would be to keep the 5 minutes but use a different statistic, e.g. maximum in your case. The difference with the previous solution is that the alarm would take longer to return to OK and would be less likely to flip between OK and ALARM if your CPU usage varies quickly around the limit.

However both cases may react to peaks.

If you expect peaks and prefer to detect a steady increase, you could lower the 0.99 threshold value and assess what is the average CPU usage that you want to alarm on instead.

profile pictureAWS
Jsc
answered a year ago
profile picture
EXPERT
reviewed 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions