Skip to content

How do I troubleshoot issues when I upgrade my Amazon MSK cluster?

5 minute read
0

I need to troubleshoot issues when I upgrade my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Considerations and best practices

Before you upgrade your Amazon MSK cluster, review the following best practices:

  • Set the replication factor for the cluster to 3 or higher. A replication factor of 1 might cause offline partitions during a rolling update. Replication factor of 2 might lead to data loss.
  • Set minimum in-sync replicas (minISR) to a value of replication factor - 1, or less. If the minISR value equals the replication factor, then it might block cluster production during a rolling update. A minISR of 2 allows three-way replicated topics to be available when one replica is offline.
  • Before you update the configuration of a cluster, make sure that the cluster is in the ACTIVE state.
  • Use the recommended Apache Kafka version when you create new Amazon MSK clusters.
  • Include at least one broker from each Availability Zone in the client strings. Multiple brokers in a client's connection string allow failovers when a specific broker goes offline for an update.
  • Use Apache Kafka AdminClient version 2.8.0 or higher for topic management.
  • Upgrade connecting clients to the recommended version or higher. Client upgrades are not subject to the end of life (EOL) dates of your Amazon MSK cluster's Kafka version.
    Note: Apache Kafka provides a bi-directional client compatibility policy that allows older clients to work with newer clusters, and it allows newer clients to work with older clusters. For more information, see Compatibility on the Apache Kafka website.
  • Upgrade your cluster during low traffic times. The amount of time that's required to upgrade the Apache Kafka version depends on the number of brokers in your cluster.
    Note: When you upgrade the cluster version, you can't make other updates until the version upgrade is complete. However, you can still produce and consume from the cluster during the upgrade.

Note: When you update the instance type, you don't automatically upgrade the cluster version.

For more information, see Best practices for version upgrades.

Monitor the upgrade

When you create an Amazon MSK cluster, you can specify which Apache Kafka version you need on the cluster. You can also update the cluster to a newer version of the Apache Kafka after you create the cluster.

You can monitor the progress of the update in the Cluster operations tab in the Amazon MSK console. After the upgrade reaches 17%, it might take several hours for the upgrade to complete.

Note: Amazon MSK performs the upgrade on a rolling restart process. Amazon MSK takes one broker out of the cluster at a time and upgrades its Kafka version. The upgraded broker rejoins the cluster and Amazon MSK takes out the next broker. This process continues until Amazon MSK upgrades the last broker with the new Kafka version.

To monitor the upgrade progress on your cluster, through the AWS CLI, run the describe-cluster-operations command:

aws kafka describe-cluster-operation —cluster-operation-arn ClusterOperationArn

If Operation is in Incomplete or Failed state, then contact AWS Support.

Troubleshoot errors

Partition operations on a broker consume large amounts of system resources. If you have a high number of partitions than the recommended limit, then you might cause a strain on the available resources in the cluster. When the cluster resources become strained, then you can't update the cluster configuration, Apache Kafka version for the cluster, or update the cluster to a smaller broker type. When the number of partitions per broker exceeds the recommended value, you receive one of the following errors:

"Error updating cluster configuration There was a problem updating cluster configuration. If the problem persists, contact AWS Support. The number of partitions per broker is above the recommended limit. Add more brokers and rearrange the partitions per broker to be below the recommended limit, then retry the request."

-or-

"Upgrade is stuck in the stage 'Initializing upgrade'"

To resolve the preceding errors take the following actions:

  • Increase the number of brokers within the cluster. Then, reassign partitions to reduce the number of partitions per broker. Use Amazon CloudWatch metrics to monitor the number of partitions per broker. For more information, see Default Amazon MSK configuration.
  • Delete unused topics. To see all the topics on the cluster and the number of partitions, run the following command:
    KAFKA_ROOT/bin/kafka-topics.sh --bootstrap-server
    BOOTSTRAP_SERVER --describe --topic Topic_name
    Note: Replace Topic_name with the topic name.
    Before you run the preceding command, set up an Apache Kafka client on an Amazon Elastic Compute Cloud (Amazon EC2) machine
  • Modify the instance type to a higher instance type.