MSK Brokers going offline

1

We are seeing a problem with our MSK cluster where one or more brokers will become unavailable (connections are refused) until the broker is manually restarted through the MSK console. I have several questions here:

  1. Given that MSK is a "fully managed service", what level of monitoring/intervention should we expect from Amazon when part of our cluster becomes unavailable?
  2. How can we detect when a Kafka broker is "alive" (producing logs) but not accepting connections? Right now we have to rely on users telling us there are issues.
  3. The troubleshooting page says that the "ActiveControllerCount" metric should always be 1 for a healthy cluster. Looking in Cloudwatch, ActiveControllerCount has been 1 for 5 minutes in the last 4 weeks. How can we find out why?

We've tried tearing the entire cluster down and recreating it, but we're still seeing these problems with the new cluster. For reference, our cluster is a t3.small cluster with 3 brokers in 3 zones.

asked a year ago239 views
1 Answer
0
  1. Given that MSK is a "fully managed service", what level of monitoring/intervention should we expect from Amazon when part of our cluster becomes unavailable?

Aa managed Service, we keep monitoring your backend Hardware such as broker instances and EBS volumes, when we detect any issue with them we trigger a workflow to replace the unhealthy instance or EBS volume which we call it as HEALING operation and your cluster will go into HEALING state during that time.

See - https://docs.aws.amazon.com/msk/latest/developerguide/msk-cluster-states.html

  1. How can we detect when a Kafka broker is "alive" (producing logs) but not accepting connections? Right now we have to rely on users telling us there are issues.
  • How can we detect when a Kafka broker is "alive":

The ActiveControllerCount Cloudwatch Metric is a cluster level metric and such all broker nodes provide a data point to this metric indicating whether it is the active controller in the cluster or not (1- it is the active controller or 0 - it isnt the active controller). Thus in a 3 node cluster if you were to look at this metric on the average statistic then this would report 0.33 as only one broker is the active controller among 3 brokers in the cluster. This is normal.

Thus if you are seeing a Cloudwatch metric then the threshold on the 'average' statistic should be 1/number of brokers i.e. if you have 3 brokers then the value should be 0.33(1/3) and not less then that. If the MSK Cluster had 6 Broker nodes then the value for the metric should not be lower then 1/6 or 0.167 and so on. Thus this metric is based on the amount of brokers that the cluster has configured.

  • but not accepting connections?:

This has to be checked from your Producer Application logs which gives 'Timed out' or 'connection refused' error according to the issue.

  1. The troubleshooting page says that the "ActiveControllerCount" metric should always be 1 for a healthy cluster. Looking in Cloudwatch, ActiveControllerCount has been 1 for 5 minutes in the last 4 weeks. How can we find out why?

Please refer to the 2nd point.

If you have still concerns regarding this, please reach out to us over a Support case.

AWS
SUPPORT ENGINEER
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions