Best Practice - Log Group, cluster servers - watch for an error - Alert on that AND what server threw that error

0

Sorry for the long post but want all the info for a straight forward direction. Getting great help from this group as I finally get this service moved to a true auto-scale cluster, then can use the knowledge to do the rest of the services following the same practice. The old setup was a server imaged, launched, each had its own log, metric, cloudwatch alarm so I knew server 5 failed, but we were limited as if one failed, we needed a new one, I was manually creating AMI's, manually updating dashboards, creating alarms, etc.

So I have a server farm part of an auto-scaling group, all servers run an internal process that outputs a health log in JSON format. There are 10 or so internal checks, but a basic system_ok, system_warning, etc. is all I care about for this first part so I can get this in production.

Now with the change, I have one log-group, each node writes to its own steam. I have a metric filter with a pattern of just 'system_ok', a metric value of 1, the default 0 and my thought was when it sees an "status_ok" it will keep it at one, however then I re-thought as if I have 2 servers, 1 will say OK and things will seem ok or are those checked on each pass as my alert was 1 miss.

So my next thought from other users help is to use Dimensions. So I think with that, I can separate things and watch each, and if so I think it can solve the next issue, that is can cloudwatch alert know what node threw that error? The AMI nodeID is in the log as well, a small part looks like this;

{
"time":1689885815,
"ver":"2.29",
"node":"i-00afaf10b8xxxxx",
"status":"system_ok",
"checks":
 **large array of check data here**
}

So I should I would make a metric with the node and status as dimensions, then somehow tie that into Cloudwatch Alarms. So am I on the right track for this last task?

asked 9 months ago189 views
1 Answer
0

If you want to identify which node is affected when you receive an alarm, it seems to me you have the following options:

  1. Keep your existing alarm and when it triggers, use a query on your logs to retrieve the node. If you do that, you keep only one alarm and one metric, and you need to implement the logic to run the query, which might be complex work.
  2. Publish the node as a dimension. If you do that, this will create as many metrics as there are node values (as each dimension values counts as a metric in CloudWatch pricing mode). You then have two options: you can create a single alarm on all dimension values using a Metrics Insights query, or you can create one metric per dimension value. Creating a single alarm using a Metrics Insights query allows you to create less alarms and will automatically adjust if you add or remove a node, however the alarm doesn't publish the dimension value that breached and you'll need to visualize the Metrics Insights query in the CloudWatch console to visualize the breach and read the dimension value from the graph. If you choose the latter option and create one alarm per dimension value, you'll get the dimension value straight in the notification but you will need to look after creating/deleting alarms when you add or remove a node.
profile pictureAWS
Jsc
answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions