I want to forcefully simulate an status check failure on one of my EC2 instance to test status check metric alarm which I recently created.

0

I want to forcefully simulate an status check failure on one of my EC2 instance to test status check metric alarm which I recently created.
I want to test my cloudwatch alarm which has following properties: Type Metric alarm Description StatusCheckFailed - awsec2-i-0a61dbf476a41166f-GreaterThanOrEqualToThreshold Threshold StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes Metric name StatusCheckFailed

질문됨 8달 전652회 조회
2개 답변
3
수락된 답변

You can do so by putting load on CPU or by exhausting instance memory. There are various things, you can do to get EC2 instance status health check failed:

  • Exhausted memory
  • File system issues
  • CPU exhaustion

If you just want to simulate that, you can pick a very low configuration instance, setup cloudwatch as what you did, then stress your CPU or memory to get instance status health check failed:

Here is how you can generate load on CPU.

Additional reference:

Hope this helps.

Comment here if you have additional questions, happy to help.

Abhishek

profile pictureAWS
전문가
답변함 8달 전
profile pictureAWS
전문가
iBehr
검토됨 8달 전
profile pictureAWS
전문가
검토됨 8달 전
  • I generated the load on CPU as per document, The is showing 99.7%, but I don't know why the status check is not failing. Please check If I have set the correct threshold for the alarm to trigger

    StatusCheckFailed StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes

  • Keep it like StatusCheckFailed >= 1 for 1 datapoints within 1 minute.

0

When you say you defined your threshold as "StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes", I assume that means you have set a period of 5 minutes. This implies that every minute, the alarm aggregates the 5 latest minutes of data with the statistic you have chosen, and evaluates the result against the threshold.

Let's take an example: if from 9:00am to 9:04am your CPU is at 50%, then from 9:04 to 9:05 your CPU jumps at 99.7%:

When the alarm evaluates the 9:05am datapoint, it retrieves all the data from 9:00 to 9:05 because you have set a period of 5 minutes, and it aggregates this data with the statistic you have set. If you set MAX as a statistic, then the value should be 99.7%. But if you set average as a statistic, then the value is 59.94% (the average of the 5 values corresponding to every minute: 50, 50, 50, 50, 99.7).

This implies that if you define your threshold as "StatusCheckFailed >= 0.99 for 1 datapoints within 5 minutes" using the average statistic, the alarm will trigger only if the CPU remains at very high usage for the whole 5 minutes (if 4 datapoints are equal to 100%, the 5th would need to be at least 95% for the average to be 99%). This alarm would detect only a very high sustained stress on your CPU. If that is not your intention, you may need to modify the definition of your alarm. There are several options, depending on what you want to do.

For example, secondabhi_aws suggests you may want to review your period to trigger the alarm for 1 datapoint in 1 minute. This would fail fast (as soon as the CPU is above 99% for 1 minute) an may be what you are looking for. Another way to do it would be to keep the 5 minutes but use a different statistic, e.g. maximum in your case. The difference with the previous solution is that the alarm would take longer to return to OK and would be less likely to flip between OK and ALARM if your CPU usage varies quickly around the limit.

However both cases may react to peaks.

If you expect peaks and prefer to detect a steady increase, you could lower the 0.99 threshold value and assess what is the average CPU usage that you want to alarm on instead.

profile pictureAWS
Jsc
답변함 8달 전
profile picture
전문가
검토됨 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠