How do I troubleshoot issues when I upgrade my Amazon MSK cluster?

5 minute read

I need to troubleshoot issues when I upgrade my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.


Considerations and best practices

Note the following before you upgrade your Amazon MSK cluster:

  • Be sure that you set the replication factor for the cluster to a value of 3 or higher. Setting a replication factor of 1 might lead to offline partitions during a rolling update. Setting a replication factor of 2 might lead to data loss.
  • Set minimum in-sync replicas (minISR) to a value of (replication factor - 1) or less. A minISR value that's equal to the replication factor might prevent you from producing to the cluster during a rolling update. A minISR of 2 allows three-way replicated topics to be available when one replica is offline.
  • Be sure that the client connection strings include at least one broker from each Availability Zone. Having multiple brokers in a client's connection string allows for failover when a specific broker is offline for an update.
  • Before you update the configuration of a cluster, make sure that the cluster is in the ACTIVE state.
  • It's a best practice to upgrade your cluster during low traffic times. The amount of time that's required to upgrade the Apache Kafka version depends on the number of brokers in your cluster.
  • You can't make other updates to the cluster when you are upgrading the cluster version. You can still produce and consume from the cluster during the upgrade.
  • There is no risk of data loss during an upgrade other than under-replicated topics (Example: topics with a replication factor of less than 3). Even in this situation, partitions are available again after the brokers are online.
  • You can update the Apache Kafka version of your Amazon MSK cluster through Amazon MSK console or AWS Command Line Interface (AWS CLI).
  • You can update your Amazon MSK cluster to a newer version of Apache Kafka. You can't update it to an older version.
  • Amazon MSK updates only the server software and doesn't update your clients. Therefore, when you upgrade your cluster, confirm that you can use the features of the new Apache Kafka version with the client software version.
  • Updating the instance type doesn't upgrade the cluster version.

Monitoring the upgrade

When you create an Amazon MSK cluster, you specify which Apache Kafka version you need on the cluster. You can also update the cluster to a newer version of Apache Kafka after you create the cluster by doing the following:

  1. Open the Amazon MSK console.
  2. Choose the cluster that you want to upgrade.
  3. On the Properties tab, choose Upgrade in the Apache Kafka version section.

For more information, see Updating the Apache Kafka version.

You can monitor the progress of the update on the Cluster operations tab. You can monitor each step of the upgrade, such as Initializing the upgrade, Updating Apache Kafka version, and Finalizing upgrade, from this tab. After the upgrade reaches 17%, it might take several hours for the upgrade to be completed. Note that Amazon MSK performs the upgrade on a rolling basis. One broker is taken out of the cluster at a time, and its Kafka version is upgraded. This broker rejoins as the next broker is taken out. This process is followed until the last broker is upgraded with the new Kafka version.

Troubleshooting common errors

Error updating cluster configuration There was a problem updating cluster configuration. If the problem persists, contact AWS Support. The number of partitions per broker is above the recommended limit. Add more brokers and rearrange the partitions per broker to be below the recommended limit, then retry the request.


Upgrade is stuck in the stage 'Initializing upgrade'

You get this error when the number of partitions per broker exceeds the recommended value. Handling of partitions on a broker is a resource-intensive workload. Having a higher number of partitions than the recommended limit might create a strain on the available resources in the cluster. Under this situation, you can't perform any of the following operations on the cluster:

  • Update the cluster configuration
  • Update the Apache Kafka version for the cluster
  • Update the cluster to a smaller broker type

To resolve this error, try the following:

  • Increase the number of brokers within the cluster. Then, reassign partitions to reduce the number of partitions per broker. Use Amazon CloudWatch metrics to know the number of partitions per broker. Partition count is the total number of topic partitions per broker, including replicas. By default, the number of partitions per topic is 1, and the replication factor is 3 for a 3-AZ cluster. Therefore, you have 3 partitions per topic because the replication factor of 3 includes the main partition. To move partitions to different brokers on the same cluster, you can use the partition reassignment tool named
  • Reduce the number of partitions by deleting unused topics. You can use the following command to see all the topics on the cluster along with the number of partitions. Be sure to set up an Apache Kafka client on an Amazon Elastic Compute Cloud (Amazon EC2) machine before running the command.
bin/ —bootstrap-server localhost:9092 —describe —topic test
AWS OFFICIALUpdated a year ago