Should MSK Security patching increase brokers CPU usage when done?

1

We had a security patching event on our cluster on September 6th, and while getting the expected spike in (user) CPU usage during the patching and immediately after it, it have now stabilized on a significantly higher level as compared to before the patching.

Is this something that would be expected (in my mind, it should not)?

Details: MSK Cluster running 3.2.0 using 6 brokers of the type kafka.t3.small using IAM for authentication and authorization.

Graph of current CPU usage (colored) compared to last weeks CPU usage (grey). last 24 hours compared to 1 qweek ago

Ilon
asked 7 months ago158 views
3 Answers
0

Hello there,

Increased CPU Utilisation after the Security patch is not expected. There could be other reasons as well which needs investigation. To troubleshoot this issue further, please reach out to us over a Support Case.

AWS
SUPPORT ENGINEER
answered 7 months ago
  • I'll create a case for this, even though the cluster in question have been vertically scaled since in order to handle this.

  • As we only have a basic support plan I created a support ticket as "General question" -> "General Info and Getting Started, Using AWS & Services" and were told I could not get help with this.

    I still think this looks like more of a general issue with either slow recovery of the cluster after rotating a broker (when using t3.small, regardless of the amount of data on the cluster) or the recommended maximum number of partitions per broker (of which the limit we were still below in this case).

0

If the CPU Utilisation continues to stay high after the patching event, I would suggest to restart the brokers which may helps to solve this issue. However, if that sorts out soon after the patching you may ignore it as the brokers require more CPU to handle the other broker's load whenever a broker goes down during the rolling restart.

AWS
SUPPORT ENGINEER
answered 6 months ago
0

I'm writing this as a follow-up response for anyone else encountering this.

The CPU usage stayed high for more than 6 days after the patching event, and in the end we took the decision to vertically scale the brokers (away from t3.small) to m5.large, in order to not encounter CPU throttling (when running out of credits). Elevated MSK CPU usage 6 days after patching Due to the time elapsed since this happened, the granularity of the data shown above is not high enough to show that we still had spikes exceeding 20%, and where still burning credits at the time we switched to m5.large.

With data stored on the cluster measured in Megabytes, number of connections to any single broker (we had 6) not exceeding 10, network traffic of less than 5 b/sec and staying below the recommended number of partitions per broker, the increased CPU load on the brokers are hard to explain by anything we as a customer are able to control.

Given the light load over all on the cluster, I also find it hard to accept that Kafka itself would require a week to internally redistribute data and reach a stable state.

As this is clearly within a normal use case for MSK, it is hard for me to understand why AWS would not like to investigate it (or at least not inform me as a customer who have reported it).

Ilon
answered 6 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions